Enhancing Breast Cancer Screening with Secure AI Model Evaluation
Use Case Description
Dr. Elena Fischer, Lead Data Scientist at a major university hospital in Berlin, is tasked with enhancing diagnostic quality in breast cancer screening: Missed tumours lead to delayed treatment and higher mortality, while false positives drive unnecessary biopsies and costs.
Radiologists at the hospital currently achieve 96% sensitivity with ~5% clinically acceptable false positives, but manual review is time-consuming and subject to fatigue. The goal is to provide an AI assistant to support radiologists, especially in the pre-diagnostic phase—helping to prioritize scans and reduce false negatives.
Key requirements:
- The model must run on-premises or close to the edge
- It must reach ≥97.5% sensitivity in real-world testing at <5% fixed false positive rate (FPR)
- It must integrate with PACS and respect all GDPR and MDR regulations
The hospital has access to a valuable internal dataset: 50.000 annotated X-ray scans (malignant/benign). While developing an in-house model remains a potential path, estimated to take 12 to 18 months, Elena decides to first evaluate the current state of the market. By conducting a structured assessment of 3 CE-marked AI vendors, she aims to understand the current state of the art, explore available commercial solutions, and gain a clearer picture of the associated costs, performance, and integration effort. This approach also provides the opportunity to access pretrained models that may generalize better and exhibit less institutional bias than a model trained solely on internal data.
Step 1: Vendor Model Metrics
Each vendor submits commercial and technical proposals:
Vendor |
Claimed Sensitivity /Specificity |
Cost per Image |
Infrastructure Load |
A |
95% / 94% |
€0,50 |
Low |
B |
96% / 95% |
€1,00 |
Moderate |
C |
97% / 96% |
€1,80 |
High |
Elena and her team prioritize sensitivity due to the high risk of missing early tumours, but they cap the false positive rate at 5% to avoid excessive misdiagnoses.
Step 2: Secure Evaluation and Fine-Tuning
Using tracebloc, Elena sets up secure sandboxes within the hospital’s IT infrastructure. Vendors do not get access to the raw data. Instead, they are invited to fine-tune their models securely on-prem, ensuring full data protection and regulatory compliance.
Each vendor is given access to:
- 40.000 images for training
- 10.000 held-out images for testing
- Predefined benchmark metrics: AUC, sensitivity, specificity, and inference time
After baseline evaluation, vendors fine-tune their models and submit new versions for testing. The results reveal a dramatic difference between claims and real-world performance on the hospitals proprietary data.
Observed Results After Testing
Vendor |
Claimed Sensitivity |
Baseline Sensitivity @ 5% FPR |
Sensitivity @ 5% FPR After Fine-Tuning |
A |
0,95 |
0,87 |
0,89 |
B |
0,96 |
0,91 |
0,691 |
C |
0,97 ✅ |
0,92 ✅ |
0,978 ✅ |
Surprise outcome: Vendor A did not match its claimed sensitivity. Vendor B held its promise but was outperformed by Vendor C who significantly improved after secure fine-tuning.
Step 3: Business Case
Assumptions:
- Annual volume: 50.000 X-ray scans
- Cancer prevalence: 0.5%
- Human sensitivity baseline: 4%, i.e. 10 missed cancers and ~2500 falsely diagnosed cases
- Cost per missed cancer: €100.000 (delayed treatment)
- Cost per false positives: €2.000 (biopsy, imaging, stress)
- AI usage: Full batch inference with radiologist-in-loop
Approach |
Estimated Sensitivity @ 5% FPR |
Missed Cancers |
Cost due to Missed Cancers |
AI Cost |
Total Cost* |
Human Only |
~0,96 |
10 |
€1.000.000 |
€0 |
€1.000.000 |
Vendor A |
0,890 |
28 |
€2.800.000 |
€25.000 |
€2.825.000 |
Vendor B |
0,961 |
10 |
€1.000.000 |
€90.000 |
€1.090.000 |
Vendor C |
0,978 |
~6 |
€600.000 |
€50.000 |
€650.000 |
* False positive costs (~€5M/year) are held constant across all approaches (5% FPR), so focus can be set solely on the impact of missed cancers.
Step 4: Vendor Selection and Strategy
After secure on-prem fine-tuning, Vendor C’s model achieved the highest sensitivity (0.978). The best-performing setup emerged from a hybrid strategy, where the AI model pre-sorts scans and flags high-risk cases for radiologists to review. This setup reduces diagnostic errors by over 60%,improves efficiency, and maintains full clinical oversight—radiologists stay in control at every step.
The hospital selects the Human + AI configuration, which delivers both strong medical outcomes and a compelling business case:
Strategy |
Estimated
Sensitivity |
Missed Cancers |
Cost due to Missed Cancers |
Total Annual Cost |
Human Only |
~0.960 |
10 |
€1,000,000 |
€1,000,000 |
Human + Vendor C |
~0.985 |
~4 |
€400,000 |
€450,000 ✅ |
Estimated annual savings: €550.000
The hospital proceeds with a 3-month pilot, including:
- On-prem deployment and PACS integration
- Radiologist-in-the-loop configuration
- Monthly performance validation and audit reporting
Disclaimer:
The persona, figures, performance metrics, and cost calculations in this case study are illustrative and based on fictionalized inputs designed to mimic real-world scenarios. They are intentionally kept at a high level to make the concepts easier to understand and communicate. These do not represent actual clinical results, vendor performance, or contractual terms, and are intended solely for strategic discussion and conceptual exploration.