Enhancing Credit Card Fraud Detection with Secure AI Model Evaluation

Participants

End Date

31.12.26

Dataset

dhsdksqa

Resources2 CPU (8.59 GB) | 1 GPU (22.49 GB)

Compute

0 / 100.00 PF

Submits

0/5

The tracebloc Playbook: How to Achieve Top Performance in Credit Card Fraud Detection

Tracebloc is a tool for benchmarking AI models on private data. This Playbook breaks down how a team used tracebloc to benchmark AI models on their credit card transaction dataset and discovered which model truly delivered the best results. Find out more on our website or schedule a call with the founder directly.

Why Model Performance Matters

Every missed fraud case costs money and every false alarm frustrates users. Using tracebloc, a fraud detection company uncovered which tabular data classification model truly performs under pressure, saving over €11 million a year compared to the next best model.

Step 1: The Use Case

Elisa Marin, Head of Fraud Analytics at a global payments provider in Frankfurt, is tasked with reducing financial losses due to undetected fraud while keeping false positives low enough to avoid customer churn and customer service overload.

Her team currently uses a rule-based system enhanced with LightGBM models trained on historical transaction data. It works decently under normal patterns but fails in edge cases like low-value frauds, synthetic identities, or cross-border patterns. As fraud tactics evolve faster than models, Elisa wants to explore external AI vendors to boost detection recall while maintaining latency and explainability targets.

Key requirements:

Model must run under 40 ms per transaction
Must achieve ≥99.5% recall on known fraud patterns, with ≤1% false positive rate
Must operate on-prem or on VPC, fully GDPR-compliant
Must explain rejection reasons in structured format (e.g. SHAP or rule-traceback)

The company has access to 300 million historical transactions labeled as "fraud" or "non-fraud", enriched with transactional and behavioural metadata

While regular retraining in-house is established, Elisa initiates an evaluation of external fraud detection vendors to tap into the market and test what's possible. Her goal: validate whether vendors can outperform the internal baseline without compromising latency, regulatory traceability, or user experience.

Step 2: What the Vendors Claimed

VENDOR	CLAIMED RECALL	MODEL LICENSE p.a.	DEPLOYMENT MODEL
A	98.5% / 0.8%	€0.5M	API, EU-hosted SaaS
B	99.3% / 0.6%	€2.5M	On-prem (Docker)
C	99.5% / 1.5%	€0.4M	GPU-accelerated VPC

While all vendors promised high recall, Vendor C's false positive rate exceeded internal limits. Vendor B offered explainable scores and full deployment flexibility, which made it the strongest initial candidate for internal review.

Step 3: Secure Evaluation and Fine-Tuning

Using tracebloc's secure benchmarking framework, Elisa’s team sets up isolated evaluation environments where vendors never access raw transaction data directly. Instead, vendors are allowed to fine-tune on feature embeddings and feedback signals inside a protected container environment.

Each vendor receives the same:

100 million syntheticized transactions for training
20 million held-out records for final evaluation
Benchmarks on latency, recall, false positive rate, and explanation completeness
Bonus: 5% of the test set contains adversarial edge cases from real incidents

After fine-tuning, vendors submit updated models or scoring pipelines for re-evaluation.

Step 4: Observed Results After Testing and Fine Tuning

VENDOR	CLAIMED RECALL	BENCHMARK RECALL	RECALL AFTER FINETUNING	FPR	LATENCY (ms)
A	98.5%	95.6%	96.1%	0.7%	35
B	99.3%	97.3%	99.5%	0.5%	31
C	99.5%	97.9%	99.6%	1.9%	54 ⚠️

Vendor B not only matched but exceeded its claimed recall, reaching 99.5% while keeping false positives and latency well within acceptable limits. Vendor C achieved the highest recall overall at 99.6%, but its high false positive rate and 54 ms latency violate production constraints for real-time scoring. Vendor A showed modest improvement but failed to meet the internal benchmark for edge-case detection.

Only Vendor B ticks all boxes after fine-tuning—high recall, low FPR, and sub-35 ms latency.

Step 5: Business Case – Cost of False Positives and Missed Fraud

Assumptions:

Annual volume: 5 billion transactions
Historical fraud rate: 0.1% → 5 million cases
Average Cost per missed fraud: €120

Cost per false positive: €0.20 (support call, customer dissatisfaction, friction)

STRATEGY	LATENCY (ms)	RECALL	MISSED FRAUDS	FALSE POSITIVES	FRAUD COST	FP COST	MODEL LICENSE COST p.a.	TOTAL COST
Intercal Only	30	97.5%	125,000	20M	€15M	€4M	0	€19M
Vendor A	35	96.1%	195,000	17M	€23.4M	€3.4M	€0.5M	€27.3M
Vendor B ✅	31	99.1%	25,000	10M	€3.0M	€2.0M	€2.5M	€7.5M
Vendor C	54 ⚠️	99.6%	20,000	95M	€2.4M	€19M ⚠️	€0.4M	€21.8M

Vendor B shows a dramatic reduction in missed fraud while keeping false positives and infrastructure load in check.

Step 6: Decision – Production Rollout with Controlled Ramp-Up

Elisa opts for a hybrid strategy, where Vendor B’s model is deployed in shadow mode on 20% of transactions, gradually expanding if performance holds.

Next steps:

Deploy Vendor B’s containerized model in on-prem fraud scoring engine
Monitor in parallel with internal model for 3 months
Activate override logic only for high-confidence fraud scores

This hybrid approach gives the bank high accuracy and flexibility without risking user trust or SLA violations.

Disclaimer:
The names, figures, and benchmarks above are fictionalized for illustrative purposes only. They represent a plausible but simplified view of how modern fraud detection models are evaluated and integrated into enterprise environments.