The tracebloc Playbook: How to Achieve Top Performance in Credit Card Fraud Detection
Tracebloc is a tool for benchmarking AI models on private data. This Playbook breaks down how a team used tracebloc to benchmark AI models on their credit card transaction dataset and discovered which model truly delivered the best results. Find out more on our website or schedule a call with the founder directly.
Why Model Performance Matters
Every missed fraud case costs money and every false alarm frustrates users. Using tracebloc, a fraud detection company uncovered which tabular data classification model truly performs under pressure, saving over €11 million a year compared to the next best model.
Step 1: The Use Case
Elisa Marin, Head of Fraud Analytics at a global payments provider in Frankfurt, is tasked with reducing financial losses due to undetected fraud while keeping false positives low enough to avoid customer churn and customer service overload.
Her team currently uses a rule-based system enhanced with LightGBM models trained on historical transaction data. It works decently under normal patterns but fails in edge cases like low-value frauds, synthetic identities, or cross-border patterns. As fraud tactics evolve faster than models, Elisa wants to explore external AI vendors to boost detection recall while maintaining latency and explainability targets.
Key requirements:
· Model must run under 40 ms per transaction
· Must achieve ≥99.5% recall on known fraud patterns, with ≤1% false positive rate
· Must operate on-prem or on VPC, fully GDPR-compliant
· Must explain rejection reasons in structured format (e.g. SHAP or rule-traceback)
The company has access to 300 million historical transactions labeled as "fraud" or "non-fraud", enriched with transactional and behavioural metadata
While regular retraining in-house is established, Elisa initiates an evaluation of external fraud detection vendors to tap into the market and test what's possible. Her goal: validate whether vendors can outperform the internal baseline without compromising latency, regulatory traceability, or user experience.
Step 2: What the Vendors Claimed
|
Vendor
|
Claimed Recall / FPR
|
Model License cost p.a.
|
Deployment Model
|
| A |
98.5% / 0.8% |
€0.5M |
API, EU-hosted SaaS |
| B |
99.3% / 0.6% |
€2.5M |
On-prem (Docker) |
| C |
99.5% / 1.5% |
€0.4M |
GPU-accelerated VPC |
While all vendors promised high recall, Vendor C's false positive rate exceeded internal limits. Vendor B offered explainable scores and full deployment flexibility, which made it the strongest initial candidate for internal review.
Step 3: Secure Evaluation and Fine-Tuning
Using tracebloc's secure benchmarking framework, Elisa’s team sets up isolated evaluation environments where vendors never access raw transaction data directly. Instead, vendors are allowed to fine-tune on feature embeddings and feedback signals inside a protected container environment.
Each vendor receives the same:
· 100 million syntheticized transactions for training
· 20 million held-out records for final evaluation
· Benchmarks on latency, recall, false positive rate, and explanation completeness
· Bonus: 5% of the test set contains adversarial edge cases from real incidents
After fine-tuning, vendors submit updated models or scoring pipelines for re-evaluation.
Step 4: Observed Results After Testing and Fine Tuning
|
Vendor
|
Claimed Recall
|
Benchmark Recall
|
Recall After Finetuning
|
FPR
|
Latency (ms)
|
| A |
98.5% |
95.6% |
96.1% |
0.7% |
35 |
| B ✅ |
99.3% |
97.3% |
99.5% |
0.5% |
31 |
| C |
99.5% |
97.9% |
99.6% |
1.9% |
54 ⚠️ |
Vendor B not only matched but exceeded its claimed recall, reaching 99.5% while keeping false positives and latency well within acceptable limits. Vendor C achieved the highest recall overall at 99.6%, but its high false positive rate and 54 ms latency violate production constraints for real-time scoring. Vendor A showed modest improvement but failed to meet the internal benchmark for edge-case detection.
Only Vendor B ticks all boxes after fine-tuning—high recall, low FPR, and sub-35 ms latency.
Step 5: Business Case – Cost of False Positives and Missed Fraud
Assumptions:
- Annual volume: 5 billion transactions
- Historical fraud rate: 0.1% → 5 million cases
- Average Cost per missed fraud: €120
- Cost per false positive: €0.20 (support call, customer dissatisfaction, friction)
|
Strategy
|
Latency (ms)
|
Recall
|
Missed Frauds
|
False Positives
|
Fraud Cost
|
FP Cost
|
Model license cost p.a. |
Total Cost
|
| Internal Only |
30 |
97.5% |
125,000 |
20M |
€15M |
€4M |
0 |
€19M |
| Vendor A |
35
|
96.1% |
195,000 |
17M |
€23.4M |
€3.4M |
€0.5M |
€27.3M |
| Vendor B ✅ |
31
|
99.1% |
25,000 |
10M |
€3.0M |
€2.0M |
€2.5M |
€7.5M |
| Vendor C |
54 ⚠️
|
99.6% |
20,000 |
95M |
€2.4M |
€19M ⚠️ |
€0.4M |
€21.8M |
Vendor B shows a dramatic reduction in missed fraud while keeping false positives and infrastructure load in check.
Step 6: Decision – Production Rollout with Controlled Ramp-Up
Elisa opts for a hybrid strategy, where Vendor B’s model is deployed in shadow mode on 20% of transactions, gradually expanding if performance holds.
Next steps:
- Deploy Vendor B’s containerized model in on-prem fraud scoring engine
- Monitor in parallel with internal model for 3 months
- Activate override logic only for high-confidence fraud scores
This hybrid approach gives the bank high accuracy and flexibility without risking user trust or SLA violations.
Disclaimer:
The names, figures, and benchmarks above are fictionalized for illustrative purposes only. They represent a plausible but simplified view of how modern fraud detection models are evaluated and integrated into enterprise environments.