FL Applications
FL Use Cases
Start Training
Metadatasets
FL Clients
Docs
Login Icon
Website
Guest user
Signup
cover

AI Mammography Screening Across 5 Sites: Sensitivity & Specificity

Participants

26

End Date

31.12.26

Dataset
dxvssxwk
Resources2 CPU (8.59 GB) | 1 GPU (22.49 GB)
Compute
0 / 50.00 PF
Submits
0/5

On this page

Book a live demo

Overview

About this use case: A hospital radiology network has a mammography screening AI that performs well on its own patient scans — and no independent evidence of what happens when the equipment changes, the imaging protocol differs, or the patient demographics shift. tracebloc runs multi-site validation on real clinical scans at each partner institution, without a single mammogram leaving the hospital that holds it. Explore the data, run your own models, and see how your approach compares.

Problem

A mammography AI model that achieves strong results on one hospital's patient scans may perform very differently when deployed at a second hospital running different imaging equipment, or serving a patient population with a different age and demographic mix. This generalisation gap is the central challenge in digital pathology and clinical AI certification. Before seeking CE mark approval, developers need evidence that their model holds up across sites — evidence that requires testing on patient scans held at multiple institutions. But patient scans are among the most tightly governed data categories in European healthcare. They do not travel between hospitals. The validation evidence required for regulatory submission and the data governance rules that govern patient imagery are structurally in conflict.

Dr. Priya Anand, Head of AI Radiology at a university hospital network, has a mammography classification model achieving 92% sensitivity at a 5% false positive rate on her institution's patient scans. Before submitting for CE mark, she needs multi-site validation data showing that performance holds across imaging protocols and patient demographics at partner hospitals. She cannot send patient scans to her own research team, let alone to a regulatory submission package without ethics clearance at each site. She needs another route.

Solution

Dr. Anand's institution deploys a tracebloc workspace seeded with 1,318 anonymised mammography scans from its own patient archive. Partner hospitals submit their mammography classification models to the workspace. Inside tracebloc's containerised training environment, each model trains on the local scan archive — fine-tuning its weights to the imaging protocol, equipment characteristics, and pathology distribution of this specific clinical environment — without any patient scan leaving the host institution's infrastructure. tracebloc handles orchestration, scores each adapted model against the 378-scan holdout set, and publishes results to a live leaderboard tracking sensitivity, specificity, and AUC. This is a federated learning application of multi-site clinical validation: the patient scans stay on each institution's infrastructure, and the validation evidence accumulates across sites without centralising a single image.

Outcome

In this example validation, the best-performing model achieved 97.8% sensitivity at a 5% false positive rate after fine-tuning on the local scan archive — exceeding the clinical threshold required for the CE mark submission. The performance gap between the leading model and the weakest submission was 8.9 percentage points on sensitivity, visible only because the evaluation ran on real patient scans inside the tracebloc workspace, not on a vendor-curated benchmark. The workspace stays active across additional partner hospital sites, generating multi-site validation evidence without repeating the ethics and governance process at each location. The leaderboard records performance across every submitted model and site.

The Operational Challenge

Dr. Anand's network processes 42,000 mammography screening examinations per year across three hospital sites. The radiologists reviewing those scans achieve a cancer detection rate of approximately 6.8 per 1,000 screened patients, with a recall rate — the proportion of scans sent for follow-up assessment — of 4.2%. Every recalled patient who turns out not to have cancer is a false positive: a biopsy, additional imaging, patient anxiety, and clinical cost. Every cancer missed in screening is a delayed diagnosis. The trade-off between sensitivity and false positive rate is the central clinical tension in breast cancer screening, and every AI system that enters this space must demonstrate it can navigate that trade-off better than the current standard.

The regulatory pathway for a CE-marked AI medical device under the EU MDR requires clinical evidence from a multi-site reader study or retrospective validation. The standard is not just that the model performed well in development — it is that the model performs well on patient populations and imaging equipment it was not developed on. That means real patient scans from partner hospitals. And that is where the pathway stalls.

Patient mammography scans in Germany, France, and the Netherlands — the three countries where Dr. Anand's partner hospitals operate — are classified as special category health data under GDPR Article 9. Data sharing between hospitals requires a legal basis beyond legitimate interest; in practice it requires a data processing agreement, ethics committee review at each participating institution, and in some cases national regulatory notification. The timeline for that process, across three countries with different data protection authority interpretations, is 12 to 18 months — longer than the development cycle of the AI system itself.

The patient population problem compounds the governance problem. Dr. Anand's own institution primarily serves a Central European patient population in a specific age bracket. Her model was developed and internally validated on that population. Whether its sensitivity holds in a French hospital serving a different demographic mix, or a Dutch hospital running a different mammography equipment vendor with different image acquisition parameters, is genuinely unknown. A model that was validated only on the developing institution's patient scans is not a CE-marked product. It is a prototype with a publication.

Mammography AI is a crowded space. The number of vendors claiming 95%+ sensitivity on their own benchmark sets has grown substantially since the EU AI Act created a commercial incentive for CE-marked AI medical devices. Claimed sensitivity numbers measured on developer-curated datasets bear no reliable relationship to sensitivity on a hospital's actual patient archive. Dr. Anand needs to know which systems hold up on her scans — and on her partner hospitals' scans — before she routes any of them into a clinical decision workflow.

Stakeholders

  • Dr. Priya Anand, Head of AI Radiology: Leads the AI validation programme. KPIs: sensitivity at fixed false positive rate, AUC across patient subgroups, multi-site generalisation evidence for regulatory submission
  • Chief Medical Officer: Responsible for patient safety — any AI system that enters the clinical decision pathway must demonstrate it does not increase missed cancer rates; MDR compliance is a board-level accountability
  • Head of Digital Pathology: Manages the imaging infrastructure and PACS integration; the AI system must run on existing hardware without requiring proprietary imaging equipment upgrades
  • Data Protection Officer: GDPR Article 9 liability for any health data processed outside the institution; validation must not create a cross-border health data transfer
  • Regulatory Affairs Lead: Building the clinical evidence package for CE mark submission; needs documented multi-site validation with auditable performance records, not marketing claims

The Underlying Dataset

The evaluation dataset contains 1,696 mammography images split across a training set of 1,318 patient scans and a holdout set of 378 patient scans. Full dataset statistics, class distributions, and image quality analysis are available in the Exploratory Data Analysis tab.

This dataset is augmented. It was constructed to reflect the statistical structure of real-world clinical mammography datasets — the pathology distribution, the near-balanced binary classification target, and the image characteristics of DICOM-derived mammography data — without containing any identifiable patient information, clinical record linkage, or facility metadata.

PropertyValue
Total images1,696
Training set1,318 patient scans
Holdout set378 patient scans
Image dimensions224×224 px
Image formatJPEG (converted from DICOM), grayscale
Classes2 — BENIGN (normal and benign merged), MALIGNANT
Class balanceNear-balanced, stable across train and holdout
Evaluation metricAccuracy / sensitivity / specificity / AUC

Class distribution (training set):

ClassCountShare
BENIGN (normal + benign)68151.7%
MALIGNANT63748.3%

A note on the class structure: normal and benign findings have been merged into a single BENIGN class, reflecting the clinical decision boundary that matters for screening — whether a scan requires further assessment. This is not a three-class problem where "normal" and "benign" require different clinical responses; at the screening decision point, both categories result in the same outcome: no immediate referral. The near-balanced distribution is intentional and reflects the enriched pathology mix in a curated clinical validation dataset rather than the prevalence in a population screening programme, where the malignancy rate is far lower.

How Evaluation Works

Each partner hospital submitted their mammography classification model to the tracebloc workspace. The evaluation ran in two phases.

Phase 1 — Out-of-the-box performance. Each model was benchmarked as-submitted on the 378-scan holdout set, with no exposure to the host institution's scan archive. This establishes the true generalisation baseline: what each model actually delivers on patient scans from a clinical environment it was not developed on.

Phase 2 — Fine-tuning. Contributors were given access to the training environment inside the tracebloc workspace. Each contributor transferred their model into tracebloc and ran training on the 1,318-scan archive. The training process fine-tuned the model weights to the imaging protocol, equipment characteristics, and pathology distribution specific to this clinical site. After training, the adapted model was evaluated automatically against the 378-scan holdout set. No patient scans were exported. Each contributor received only their own results; no contributor had visibility into another's training runs or scores before the leaderboard published.

Each contributor received:

  • Training access: 1,318 anonymised mammography scans (681 benign, 637 malignant) for model fine-tuning inside the workspace
  • Evaluation environment: Sandboxed execution — adapted models run against the holdout set, no scan export path available
  • Metrics tracked: Overall accuracy, sensitivity at fixed 5% false positive rate, specificity, AUC, false positive rate
  • Regulatory context: Performance records are structured for inclusion in a CE mark clinical evidence package — auditable, site-specific, and tied to a documented evaluation protocol

Results

→ View the full model leaderboard — complete model rankings, sensitivity-specificity curves, and AUC scores across all submissions.

ModelClaimed SensitivityOut-of-the-BoxAfter Fine-tuningSpecificityAUC
Model A95.0%87.3%88.9%91.2%0.91
Model B ✅96.0%91.5%97.8%93.6%0.96
Model C ⚠️97.0%92.1%97.2%85.4%0.93

Sensitivity reported at fixed 5% false positive rate.

What the numbers reveal:

Model B did what multi-site validation is designed to surface: it responded to fine-tuning on real patient scans with a gain that exceeded its own claimed sensitivity. Starting at 91.5% out-of-the-box on this institution's scans, it reached 97.8% after training on the 1,318-scan archive — a 6.3 percentage point improvement — while holding specificity at 93.6% and achieving an AUC of 0.96. It met and exceeded the clinical threshold required for the CE mark submission.

Model C achieved 97.2% sensitivity after fine-tuning — 0.6 percentage points below Model B — but at a specificity of 85.4%, substantially lower than Model B's 93.6%. On a population screening programme processing 42,000 examinations per year, a 8.2 percentage point specificity gap generates a materially different false positive burden: hundreds of additional unnecessary follow-up assessments per year, each carrying biopsy costs and patient anxiety. A sensitivity number without its paired specificity is not a clinical performance claim. It is a number quoted when the false positive rate is not expected to be scrutinised.

Model A showed the largest gap between claimed and observed performance. Its claimed sensitivity of 95.0% degraded to 87.3% on this institution's patient scans before fine-tuning — the sharpest generalisation failure in the evaluation. After fine-tuning it recovered to 88.9%, still 8.9 percentage points behind Model B. Its AUC of 0.91 indicates a model that has not adapted to the imaging characteristics of this clinical environment, regardless of what its internal benchmark numbers showed.

Business Impact

Illustrative assumptions: 42,000 screening examinations per year / cancer prevalence 0.68% (286 cancer cases) / €85,000 average cost per missed cancer (delayed treatment) / €2,400 cost per false positive biopsy (imaging, procedure, follow-up) / false positive rate at 5% FPR threshold applied to the 41,714 non-cancer examinations

ApproachSensitivityMissed CancersFalse PositivesMissed Cancer CostFP CostAI Cost (p.a.)Total Annual Cost
Radiologist baseline92.0%231,762€1,955,000€4,229,000—€6,184,000
Model A88.9%322,086€2,720,000€5,006,000€130,000€7,856,000
Model B ✅97.8%62,086€510,000€5,006,000€250,000€5,766,000
Model C ⚠️97.2%83,524€680,000€8,458,000€180,000€9,318,000

Model B reduces total annual cost from €6,184,000 (radiologist baseline) to €5,766,000 — a saving of €418,000 per year — while catching 17 more cancers annually that would otherwise be missed. Model C's sensitivity is strong but its specificity gap nearly doubles the false positive cost, producing a total annual cost 62% higher than Model B despite similar cancer detection performance. A sensitivity number that ignores what happens to the 41,714 women who do not have cancer is not a complete clinical performance number.

Decision

Dr. Anand selects Model B for a three-month reader study at the primary site, running in parallel with standard radiologist reporting on 15% of the screening workload. The reader study protocol is designed to generate the prospective clinical evidence required alongside the retrospective validation data for the CE mark submission. Results from the reader study feed back into the tracebloc workspace evaluation record, creating a documented chain of evidence from retrospective validation to prospective clinical use.

The tracebloc workspace is replicated at two partner hospital sites in France and the Netherlands as part of the multi-site validation programme. Each site runs the same evaluation protocol on their own patient scans, generating site-specific performance records without any cross-border data transfer. The aggregate results from all three sites form the multi-site clinical evidence package for the CE mark submission. The leaderboard records performance across all sites and all model submissions — turning a one-time regulatory exercise into a continuous clinical AI governance infrastructure.

Explore this use case further:

  • View the model leaderboard — full model rankings, sensitivity-specificity curves, AUC scores
  • Explore the dataset — pathology distribution, image quality analysis, class balance
  • Start training — submit your own mammography classification model to this evaluation

Related use cases: See how the same multi-site validation approach applies to retinal disease classification and radiation therapy optimisation in prostate cancer research. For a broader view of what federated learning applications look like across industries, see our federated learning applications guide.

Deploy your workspace or schedule a call.

Disclaimer

Disclaimer: The dataset used in this use case is augmented — designed to closely reflect the statistical structure of real-world clinical mammography data, including pathology distribution, class balance between benign and malignant findings, and image characteristics of DICOM-derived mammography scans, without containing any identifiable patient information, clinical record linkage, or facility metadata. The persona, hospital configurations, claimed performance figures, business impact assumptions, and regulatory scenario are illustrative and based on patterns observed across clinical AI validation and CE mark submission processes in Europe. They do not represent any specific hospital, AI product, or regulatory outcome.