FL Applications
FL Use Cases
Start Training
Metadatasets
FL Clients
Docs
Login Icon
Website
Guest user
Signup
cover

Credit Card Fraud Detection — Evaluate AI Models on Real Data

Participants

5

End Date

30.06.27

Dataset
dhsdksqa
Resources2 CPU (8.59 GB) | 1 GPU (22.49 GB)
Compute
0 / 100.00 PF
Submits
0/5

On this page

Book a live demo

Overview

About this use case: Three fraud detection vendors all claim 99%+ recall — measured on datasets they curated, at class balances that don't reflect a 0.17% fraud rate on 5 billion real transactions. tracebloc benchmarks each vendor on the payment provider's actual fraud patterns inside their infrastructure, without a single transaction record ever reaching a vendor's hands. Explore the data, submit your own model, and see how your approach compares.

Problem

Three fraud detection vendors all claim 99%+ recall on their published benchmarks. Elisa Marin, Head of Fraud Analytics at a global payments provider in Frankfurt, has been in fraud long enough to know what those numbers mean: they were measured on datasets the vendors curated, with fraud patterns they selected, at a class balance that rarely reflects production reality. Her transaction mix is different. Her fraud rate is 0.17%. Her latency SLA is 40 milliseconds per transaction. She needs to know which system will actually catch her fraud — not whose brochure reads best.

Solution

Elisa deploys a tracebloc workspace loaded with 284,807 anonymised payment transactions. Each vendor submits their fraud detection model to the workspace. Inside tracebloc's containerised training environment, vendors train their model on the 227,845-record dataset — fine-tuning it to her specific transaction patterns and fraud mix — without the data ever leaving her infrastructure. tracebloc handles the training orchestration, scores each adapted model against the holdout set, and publishes results to a live leaderboard automatically. This is a federated learning application of vendor acceptance testing: the payment data stays on Elisa's infrastructure from start to finish.

Outcome

In this example evaluation, one vendor exceeded its claimed recall after fine-tuning on real transaction data. Another achieved the highest raw recall — but at a false positive rate that would generate 95 million unnecessary friction events per year on a 5-billion-transaction portfolio. The €11.5 million annual difference between the top two vendors was only visible because the evaluation ran on real transaction patterns inside tracebloc, not on a vendor-selected benchmark. The workspace stays in place for continuous re-evaluation as fraud patterns evolve and new vendors enter the market.

The Operational Challenge

Elisa's team manages fraud detection across a portfolio processing 5 billion transactions a year. The internal baseline — a rule-based engine layered with LightGBM models retrained quarterly on historical data — achieves 97.5% recall at a false positive rate that generates roughly 20 million friction events annually. That is the number every external vendor has to beat. Not on their terms. On Elisa's.

Payment fraud prevention has become an arms race. Fraud tactics evolve continuously — synthetic identities, account takeover patterns, cross-border velocity signals — and the internal team can only retrain so fast. Elisa wants to know whether the financial crime AI vendors now entering the market can genuinely outperform what her team has built, or whether their claims evaporate when exposed to real transaction patterns.

The procurement problem is structural. GDPR prohibits Elisa from sharing production transaction data with external vendors during evaluation. Legal will not approve it. And even if they could share data, giving vendors a sample in advance allows them to tune specifically for that evaluation — which defeats the purpose of independent testing. The only alternative is to trust vendor-provided benchmark numbers. That is no alternative at all.

Vendor A claims 98.5% recall at sub-1% FPR. Vendor B claims 99.3% at 0.6% FPR. Vendor C claims 99.5% — but their false positive rate in the fine print is 1.5%, which at 5 billion transactions per year means tens of millions more fraud alerts than Elisa's operations team can process.

She needs a way to test all three systems on her actual fraud patterns, in a production-equivalent environment, without handing over a single transaction record.

Stakeholders

  • Elisa Marin, Head of Fraud Analytics: Owns detection performance, vendor evaluation, and model risk governance. KPIs: detection rate, false positive rate, latency SLA, annual fraud loss
  • Chief Risk Officer: Regulatory exposure, capital requirements, and audit trail for all model decisions involving transaction rejection
  • Data Protection Officer: GDPR liability for any data transfer to external parties — evaluation vendors included
  • Customer Experience Lead: Every false alarm is a declined transaction, a customer call, and a potential churn event; FPR is her problem as much as Elisa's
  • Infrastructure Engineering: On-premise or VPC deployment with hard latency ceiling of 40 ms per transaction at full volume

The Underlying Dataset

The evaluation dataset contains 284,807 anonymised payment transactions split across a training set of 227,845 records and a holdout set of 56,962 records. Full dataset statistics, class distributions, and feature analysis are available in the Exploratory Data Analysis tab.

This dataset is augmented. It was constructed to match the statistical structure of real-world credit card transactions — the fraud rate, the amount distribution, the feature correlation patterns — without containing any identifiable cardholder, merchant, or transaction data.

PropertyValue
Total records284,807
Training set227,845 records
Holdout set56,962 records
Features31 (Time, V1–V28, Amount, Class)
Fraud rate — training0.17% (394 fraud / 227,451 non-fraud)
Fraud cases — holdout98
Mean transaction amount$88.48
Max transaction amount$25,691.16
Missing valuesNone

A note on the features: V1 through V28 are the result of a PCA transformation. This is not an artefact of augmentation — it reflects how payment processors actually share data with research partners: principal components preserve the statistical patterns that drive detection while making it impossible to reverse-engineer individual transactions. The class imbalance (0.17% fraud) is preserved exactly. A model that predicts "non-fraud" on every transaction achieves 99.83% accuracy — which is why accuracy is not the metric that matters here.

How Evaluation Works

Each vendor submitted their fraud detection model to the tracebloc workspace. The evaluation ran in two phases.

Phase 1 — Out-of-the-box performance. Each vendor's model was benchmarked as-submitted, with no adaptation to Elisa's transaction data. This establishes the true baseline: what the system actually delivers when installed on a new customer's environment without customisation.

Phase 2 — Fine-tuning. Vendors were given access to the training environment inside the tracebloc workspace. Each vendor transferred their model into tracebloc and ran training on the 227,845-record dataset. This training process fine-tuned the model weights to Elisa's specific fraud patterns, transaction mix, and class distribution — adapting the model from a generalised fraud detector to a system calibrated for her portfolio. After training, the adapted model was submitted automatically for evaluation against the 56,962-record holdout set. The training data never left Elisa's infrastructure. Vendors received only their own results back; no vendor had visibility into another's training runs or scores before the leaderboard published.

Each vendor received:

  • Training access: 227,845 anonymised transactions (394 fraud cases, 0.17% fraud rate) for model fine-tuning inside the workspace
  • Evaluation environment: Sandboxed execution — adapted models run against the holdout set, no data export path available
  • Metrics tracked: Recall (fraud detection rate), false positive rate, inference latency per transaction (ms), structured SHAP explainability output for rejected transactions
  • Edge case set: 5% of the holdout includes synthetic adversarial patterns — low-value fraud, round-number amounts, cross-region velocity anomalies

Results

→ View the full model leaderboard — complete vendor rankings, precision-recall curves, and latency measurements across all submissions.

VendorClaimed RecallOut-of-the-BoxAfter Fine-tuningFPRLatency
Vendor A98.5%95.6%96.1%0.7%35 ms
Vendor B ✅99.3%97.3%99.5%0.5%31 ms
Vendor C ⚠️99.5%97.9%99.6%1.9%54 ms

What the numbers reveal:

Vendor B did something most benchmarks never surface — it improved beyond its own claimed recall after fine-tuning on Elisa's transaction data. Starting at 97.3% out-of-the-box, it reached 99.5% after training on 227,845 real-distribution records inside the tracebloc workspace, while holding its false positive rate at 0.5% and its latency at 31 ms — inside every constraint Elisa set.

Vendor C achieved the highest raw recall at 99.6%, but at 1.9% FPR — nearly four times Vendor B's false alarm rate. At 5 billion transactions per year, that difference generates over 70 million additional friction events annually. It also breached the 40 ms latency ceiling at 54 ms, disqualifying it from real-time transaction scoring regardless of its recall.

Vendor A showed the largest gap between claimed and actual performance. Its 98.5% claimed recall degraded to 95.6% on Elisa's fraud patterns before fine-tuning — the sharpest performance collapse in the evaluation. After adapting via tracebloc's training environment, it recovered to 96.1%: still below the internal baseline Elisa was trying to beat.

Business Impact

Illustrative assumptions: 5 billion transactions per year / 0.1% base fraud rate (5 million fraud cases) / €120 average cost per missed fraud / €0.20 cost per false positive (customer service handling, friction, churn exposure)

StrategyRecallMissed FraudsFalse PositivesFraud LossFP CostLicence (p.a.)Total Annual Cost
Internal baseline97.5%125,00020M€15.0M€4.0M—€19.0M
Vendor A96.1%195,00014M€23.4M€2.8M€0.5M€26.7M
Vendor B ✅99.5%25,00010M€3.0M€2.0M€2.5M€7.5M
Vendor C ⚠️99.6%20,00095M€2.4M€19.0M€0.4M€21.8M

Vendor B reduces total annual cost from €19.0M (internal baseline) to €7.5M — a saving of €11.5M per year — despite carrying the highest licence fee of any vendor evaluated.

Vendor C's recall number looks better than Vendor B's on a spreadsheet. Its total annual cost is three times higher, because 95 million false positives at €0.20 per event costs more than the entire fraud loss Vendor C prevents. Headline recall without FPR context is not a metric. It is a number vendors use when they know you will not test them on your own data.

Decision

Elisa selects Vendor B for production deployment, starting in shadow mode alongside the internal LightGBM system across 20% of transaction volume. Three months of shadow operation gives her team confirmation that 99.5% recall holds at full throughput, that latency stays under 31 ms at scale, and that SHAP outputs meet the explainability standard required for GDPR Article 22 compliance before the override logic goes live.

The tracebloc workspace stays active after the initial evaluation. As fraud patterns shift, Vendor B releases model updates, and new vendors enter the market, Elisa can re-evaluate without rebuilding the infrastructure, without renegotiating data access with legal, and without GDPR risk. The leaderboard becomes a live record of which systems are performing and which are degrading — turning a one-off procurement decision into ongoing model risk governance.

Explore this use case further:

  • View the model leaderboard — full vendor rankings, precision-recall curves, latency measurements
  • Explore the dataset — transaction distribution, fraud rate analysis, feature behaviour
  • Start training — submit your own fraud detection model to this evaluation

Related use cases: See how the same evaluation approach applies to financial time series forecasting and insurance claims classification. For a broader view of what federated learning applications look like across industries, see our federated learning applications guide.

Deploy your workspace or schedule a call with the team.

Disclaimer

Disclaimer: The dataset used in this use case is augmented — designed to closely reflect the statistical structure of real-world payment fraud data, including class distribution, feature relationships, and transaction amount ranges, without containing any identifiable cardholder, merchant, or transaction information. The persona, vendor names, claimed performance figures, business impact assumptions, and procurement scenario are illustrative and based on patterns observed across financial services deployments. They do not represent any specific company, product, or contractual outcome.