
Enhancing Credit Card Fraud Detection with Secure AI Model Evaluation
Participants
5
End Date
Ended
Dataset
dhsdksqa
Resources2 CPU (8.59 GB) | 1 GPU (22.49 GB)
Compute
0 / 0 F
Submits
0/5
Overview

5
Ended
Tracebloc is a tool for benchmarking AI models on private data. This Playbook breaks down how a team used tracebloc to benchmark AI models on their credit card transaction dataset and discovered which model truly delivered the best results. Find out more on our website or schedule a call with the founder directly.
Every missed fraud case costs money and every false alarm frustrates users. Using tracebloc, a fraud detection company uncovered which tabular data classification model truly performs under pressure, saving over €11 million a year compared to the next best model.
Elisa Marin, Head of Fraud Analytics at a global payments provider in Frankfurt, is tasked with reducing financial losses due to undetected fraud while keeping false positives low enough to avoid customer churn and customer service overload.
Her team currently uses a rule-based system enhanced with LightGBM models trained on historical transaction data. It works decently under normal patterns but fails in edge cases like low-value frauds, synthetic identities, or cross-border patterns. As fraud tactics evolve faster than models, Elisa wants to explore external AI vendors to boost detection recall while maintaining latency and explainability targets.
Key requirements:
The company has access to 300 million historical transactions labeled as "fraud" or "non-fraud", enriched with transactional and behavioural metadata
While regular retraining in-house is established, Elisa initiates an evaluation of external fraud detection vendors to tap into the market and test what's possible. Her goal: validate whether vendors can outperform the internal baseline without compromising latency, regulatory traceability, or user experience.
| VENDOR | CLAIMED RECALL | MODEL LICENSE p.a. | DEPLOYMENT MODEL |
| A | 98.5% / 0.8% | €0.5M | API, EU-hosted SaaS |
| B | 99.3% / 0.6% | €2.5M | On-prem (Docker) |
| C | 99.5% / 1.5% | €0.4M | GPU-accelerated VPC |
While all vendors promised high recall, Vendor C's false positive rate exceeded internal limits. Vendor B offered explainable scores and full deployment flexibility, which made it the strongest initial candidate for internal review.
Using tracebloc's secure benchmarking framework, Elisa’s team sets up isolated evaluation environments where vendors never access raw transaction data directly. Instead, vendors are allowed to fine-tune on feature embeddings and feedback signals inside a protected container environment.
Each vendor receives the same:
After fine-tuning, vendors submit updated models or scoring pipelines for re-evaluation.
| VENDOR | CLAIMED RECALL | BENCHMARK RECALL | RECALL AFTER FINETUNING | FPR | LATENCY (ms) |
| A | 98.5% | 95.6% | 96.1% | 0.7% | 35 |
| B | 99.3% | 97.3% | 99.5% | 0.5% | 31 |
| C | 99.5% | 97.9% | 99.6% | 1.9% | 54 ⚠️ |
Vendor B not only matched but exceeded its claimed recall, reaching 99.5% while keeping false positives and latency well within acceptable limits. Vendor C achieved the highest recall overall at 99.6%, but its high false positive rate and 54 ms latency violate production constraints for real-time scoring. Vendor A showed modest improvement but failed to meet the internal benchmark for edge-case detection.
Only Vendor B ticks all boxes after fine-tuning—high recall, low FPR, and sub-35 ms latency.
Assumptions:
Cost per false positive: €0.20 (support call, customer dissatisfaction, friction)
| STRATEGY | LATENCY (ms) | RECALL | MISSED FRAUDS | FALSE POSITIVES | FRAUD COST | FP COST | MODEL LICENSE COST p.a. | TOTAL COST |
| Intercal Only | 30 | 97.5% | 125,000 | 20M | €15M | €4M | 0 | €19M |
| Vendor A | 35 | 96.1% | 195,000 | 17M | €23.4M | €3.4M | €0.5M | €27.3M |
| Vendor B ✅ | 31 | 99.1% | 25,000 | 10M | €3.0M | €2.0M | €2.5M | €7.5M |
| Vendor C | 54 ⚠️ | 99.6% | 20,000 | 95M | €2.4M | €19M ⚠️ | €0.4M | €21.8M |
Vendor B shows a dramatic reduction in missed fraud while keeping false positives and infrastructure load in check.
Elisa opts for a hybrid strategy, where Vendor B’s model is deployed in shadow mode on 20% of transactions, gradually expanding if performance holds.
Next steps:
This hybrid approach gives the bank high accuracy and flexibility without risking user trust or SLA violations.
Disclaimer:
The names, figures, and benchmarks above are fictionalized for illustrative purposes only. They represent a plausible but simplified view of how modern fraud detection models are evaluated and integrated into enterprise environments.