
Credit Card Fraud Detection — Evaluate AI Models on Real Data
Participants
5
End Date
30.06.27
Dataset
dhsdksqa
Resources2 CPU (8.59 GB) | 1 GPU (22.49 GB)
Compute
0 / 100.00 PF
Submits
0/5

5
30.06.27
On this page
About this use case: Three fraud detection vendors all claim 99%+ recall — measured on datasets they curated, at class balances that don't reflect a 0.17% fraud rate on 5 billion real transactions. tracebloc benchmarks each vendor on the payment provider's actual fraud patterns inside their infrastructure, without a single transaction record ever reaching a vendor's hands. Explore the data, submit your own model, and see how your approach compares.
Three fraud detection vendors all claim 99%+ recall on their published benchmarks. Elisa Marin, Head of Fraud Analytics at a global payments provider in Frankfurt, has been in fraud long enough to know what those numbers mean: they were measured on datasets the vendors curated, with fraud patterns they selected, at a class balance that rarely reflects production reality. Her transaction mix is different. Her fraud rate is 0.17%. Her latency SLA is 40 milliseconds per transaction. She needs to know which system will actually catch her fraud — not whose brochure reads best.
Elisa deploys a tracebloc workspace loaded with 284,807 anonymised payment transactions. Each vendor submits their fraud detection model to the workspace. Inside tracebloc's containerised training environment, vendors train their model on the 227,845-record dataset — fine-tuning it to her specific transaction patterns and fraud mix — without the data ever leaving her infrastructure. tracebloc handles the training orchestration, scores each adapted model against the holdout set, and publishes results to a live leaderboard automatically. This is a federated learning application of vendor acceptance testing: the payment data stays on Elisa's infrastructure from start to finish.
In this example evaluation, one vendor exceeded its claimed recall after fine-tuning on real transaction data. Another achieved the highest raw recall — but at a false positive rate that would generate 95 million unnecessary friction events per year on a 5-billion-transaction portfolio. The €11.5 million annual difference between the top two vendors was only visible because the evaluation ran on real transaction patterns inside tracebloc, not on a vendor-selected benchmark. The workspace stays in place for continuous re-evaluation as fraud patterns evolve and new vendors enter the market.
Elisa's team manages fraud detection across a portfolio processing 5 billion transactions a year. The internal baseline — a rule-based engine layered with LightGBM models retrained quarterly on historical data — achieves 97.5% recall at a false positive rate that generates roughly 20 million friction events annually. That is the number every external vendor has to beat. Not on their terms. On Elisa's.
Payment fraud prevention has become an arms race. Fraud tactics evolve continuously — synthetic identities, account takeover patterns, cross-border velocity signals — and the internal team can only retrain so fast. Elisa wants to know whether the financial crime AI vendors now entering the market can genuinely outperform what her team has built, or whether their claims evaporate when exposed to real transaction patterns.
The procurement problem is structural. GDPR prohibits Elisa from sharing production transaction data with external vendors during evaluation. Legal will not approve it. And even if they could share data, giving vendors a sample in advance allows them to tune specifically for that evaluation — which defeats the purpose of independent testing. The only alternative is to trust vendor-provided benchmark numbers. That is no alternative at all.
Vendor A claims 98.5% recall at sub-1% FPR. Vendor B claims 99.3% at 0.6% FPR. Vendor C claims 99.5% — but their false positive rate in the fine print is 1.5%, which at 5 billion transactions per year means tens of millions more fraud alerts than Elisa's operations team can process.
She needs a way to test all three systems on her actual fraud patterns, in a production-equivalent environment, without handing over a single transaction record.
The evaluation dataset contains 284,807 anonymised payment transactions split across a training set of 227,845 records and a holdout set of 56,962 records. Full dataset statistics, class distributions, and feature analysis are available in the Exploratory Data Analysis tab.
This dataset is augmented. It was constructed to match the statistical structure of real-world credit card transactions — the fraud rate, the amount distribution, the feature correlation patterns — without containing any identifiable cardholder, merchant, or transaction data.
| Property | Value |
|---|---|
| Total records | 284,807 |
| Training set | 227,845 records |
| Holdout set | 56,962 records |
| Features | 31 (Time, V1–V28, Amount, Class) |
| Fraud rate — training | 0.17% (394 fraud / 227,451 non-fraud) |
| Fraud cases — holdout | 98 |
| Mean transaction amount | $88.48 |
| Max transaction amount | $25,691.16 |
| Missing values | None |
A note on the features: V1 through V28 are the result of a PCA transformation. This is not an artefact of augmentation — it reflects how payment processors actually share data with research partners: principal components preserve the statistical patterns that drive detection while making it impossible to reverse-engineer individual transactions. The class imbalance (0.17% fraud) is preserved exactly. A model that predicts "non-fraud" on every transaction achieves 99.83% accuracy — which is why accuracy is not the metric that matters here.
Each vendor submitted their fraud detection model to the tracebloc workspace. The evaluation ran in two phases.
Phase 1 — Out-of-the-box performance. Each vendor's model was benchmarked as-submitted, with no adaptation to Elisa's transaction data. This establishes the true baseline: what the system actually delivers when installed on a new customer's environment without customisation.
Phase 2 — Fine-tuning. Vendors were given access to the training environment inside the tracebloc workspace. Each vendor transferred their model into tracebloc and ran training on the 227,845-record dataset. This training process fine-tuned the model weights to Elisa's specific fraud patterns, transaction mix, and class distribution — adapting the model from a generalised fraud detector to a system calibrated for her portfolio. After training, the adapted model was submitted automatically for evaluation against the 56,962-record holdout set. The training data never left Elisa's infrastructure. Vendors received only their own results back; no vendor had visibility into another's training runs or scores before the leaderboard published.
→ View the full model leaderboard — complete vendor rankings, precision-recall curves, and latency measurements across all submissions.
| Vendor | Claimed Recall | Out-of-the-Box | After Fine-tuning | FPR | Latency |
|---|---|---|---|---|---|
| Vendor A | 98.5% | 95.6% | 96.1% | 0.7% | 35 ms |
| Vendor B ✅ | 99.3% | 97.3% | 99.5% | 0.5% | 31 ms |
| Vendor C ⚠️ | 99.5% | 97.9% | 99.6% | 1.9% | 54 ms |
What the numbers reveal:
Vendor B did something most benchmarks never surface — it improved beyond its own claimed recall after fine-tuning on Elisa's transaction data. Starting at 97.3% out-of-the-box, it reached 99.5% after training on 227,845 real-distribution records inside the tracebloc workspace, while holding its false positive rate at 0.5% and its latency at 31 ms — inside every constraint Elisa set.
Vendor C achieved the highest raw recall at 99.6%, but at 1.9% FPR — nearly four times Vendor B's false alarm rate. At 5 billion transactions per year, that difference generates over 70 million additional friction events annually. It also breached the 40 ms latency ceiling at 54 ms, disqualifying it from real-time transaction scoring regardless of its recall.
Vendor A showed the largest gap between claimed and actual performance. Its 98.5% claimed recall degraded to 95.6% on Elisa's fraud patterns before fine-tuning — the sharpest performance collapse in the evaluation. After adapting via tracebloc's training environment, it recovered to 96.1%: still below the internal baseline Elisa was trying to beat.
Illustrative assumptions: 5 billion transactions per year / 0.1% base fraud rate (5 million fraud cases) / €120 average cost per missed fraud / €0.20 cost per false positive (customer service handling, friction, churn exposure)
| Strategy | Recall | Missed Frauds | False Positives | Fraud Loss | FP Cost | Licence (p.a.) | Total Annual Cost |
|---|---|---|---|---|---|---|---|
| Internal baseline | 97.5% | 125,000 | 20M | €15.0M | €4.0M | — | €19.0M |
| Vendor A | 96.1% | 195,000 | 14M | €23.4M | €2.8M | €0.5M | €26.7M |
| Vendor B ✅ | 99.5% | 25,000 | 10M | €3.0M | €2.0M | €2.5M | €7.5M |
| Vendor C ⚠️ | 99.6% | 20,000 | 95M | €2.4M | €19.0M | €0.4M | €21.8M |
Vendor B reduces total annual cost from €19.0M (internal baseline) to €7.5M — a saving of €11.5M per year — despite carrying the highest licence fee of any vendor evaluated.
Vendor C's recall number looks better than Vendor B's on a spreadsheet. Its total annual cost is three times higher, because 95 million false positives at €0.20 per event costs more than the entire fraud loss Vendor C prevents. Headline recall without FPR context is not a metric. It is a number vendors use when they know you will not test them on your own data.
Elisa selects Vendor B for production deployment, starting in shadow mode alongside the internal LightGBM system across 20% of transaction volume. Three months of shadow operation gives her team confirmation that 99.5% recall holds at full throughput, that latency stays under 31 ms at scale, and that SHAP outputs meet the explainability standard required for GDPR Article 22 compliance before the override logic goes live.
The tracebloc workspace stays active after the initial evaluation. As fraud patterns shift, Vendor B releases model updates, and new vendors enter the market, Elisa can re-evaluate without rebuilding the infrastructure, without renegotiating data access with legal, and without GDPR risk. The leaderboard becomes a live record of which systems are performing and which are degrading — turning a one-off procurement decision into ongoing model risk governance.
Explore this use case further:
Related use cases: See how the same evaluation approach applies to financial time series forecasting and insurance claims classification. For a broader view of what federated learning applications look like across industries, see our federated learning applications guide.
Deploy your workspace or schedule a call with the team.
Disclaimer: The dataset used in this use case is augmented — designed to closely reflect the statistical structure of real-world payment fraud data, including class distribution, feature relationships, and transaction amount ranges, without containing any identifiable cardholder, merchant, or transaction information. The persona, vendor names, claimed performance figures, business impact assumptions, and procurement scenario are illustrative and based on patterns observed across financial services deployments. They do not represent any specific company, product, or contractual outcome.