
Diabetic Retinopathy AI Validation Across Multiple Clinical Sites
Participants
26
End Date
21.05.27
Dataset
d7lff7pn
Resources2 CPU (8.59 GB) | 1 GPU (22.49 GB)
Compute
0 / 50.00 PF
Submits
0/5

26
21.05.27
On this page
About this use case: An ophthalmic institute's diabetic retinopathy AI scores 94% on its own protocol — and has no data on what happens when the fundus camera is a different manufacturer or the patient demographics shift. tracebloc validates competing systems on the institute's real clinical scan distribution, including dark, bright, and low-contrast images, without a single patient scan leaving the institute's infrastructure. Explore the data, submit your own AI system, and see how your approach compares.
The ophthalmic institute's diabetic retinopathy AI system scores 94% diagnostic accuracy on its own imaging protocol. What it cannot tell Dr. Elena Martin, Clinical AI Lead, is how it performs when the fundus camera is a different manufacturer, the acquisition settings are different, or the patient demographics skew differently than the institute's screening population. Before applying for CE mark and deploying to partner clinics, she needs that answer — and she needs it without shipping patient scans across institutional boundaries.
Elena deploys a tracebloc workspace loaded with 2,497 anonymised retinal images covering the institute's patient population, including the full range of imaging conditions observed in clinical practice. Partner clinics and AI system contributors submit their retinal disease classification models to the workspace. Inside tracebloc's containerised training environment, each AI system trains on the dataset — fine-tuning its weights to the specific image characteristics, brightness distributions, and pathology patterns in this patient population — without any patient scan leaving the institute's infrastructure. This is a federated learning application of clinical AI validation: the patient scans stay on Elena's infrastructure from start to finish. tracebloc orchestrates evaluation, scores each system against the holdout set, and publishes results to a live leaderboard.
In this example validation, the AI system with the most modest vendor claims delivered the strongest diagnostic accuracy after adaptation to the institute's patient scans — while the highest-claiming vendor showed the sharpest collapse on real clinical image quality. The leaderboard made that gap impossible to miss before any deployment decision was made. The workspace stays active for ongoing validation as new imaging equipment enters partner clinics and new AI systems enter the market.
Elena's team manages a diabetic eye screening programme serving a population across multiple clinic locations. The AI-assisted grading workflow — where the system flags fundus images for human review based on disease probability — handles several thousand patient scans per month. The internal AI system, trained on images from the institute's primary fundus camera, performs reliably in that controlled environment. The problem is that "reliably" is not a property that transfers automatically across imaging hardware.
Diabetic retinopathy AI is acutely sensitive to acquisition conditions. Fundus cameras from different manufacturers produce images with systematically different brightness profiles, colour channel distributions, and contrast characteristics. A system trained primarily on images from one camera manufacturer can lose several percentage points of sensitivity when deployed on another — not because the pathology looks different, but because the pixel-level distribution has shifted. The institute's dataset reflects this reality: 26.4% of images are classified as dark, 4.2% as bright, and 125 images are flagged as low-contrast, representing the full range of quality that arrives from clinical practice.
The CE mark application requires documented evidence that the AI system generalises across the populations and imaging environments it will be deployed in. That means validation data from partner clinics — not just the institute's own protocol. The data governance constraint is immediate: partner clinics will not share patient scans, and the institute cannot require them to. Any multi-site validation that requires centralising retinal images will take months of ethics committee approvals and data sharing agreements, if it gets approved at all.
Elena needs to know which AI systems maintain sensitivity and specificity across imaging conditions before she commits to a deployment partner — and she needs that answer on real patient scan distributions, not vendor slide decks.
The validation dataset contains 2,497 anonymised retinal images at 512×512 RGB resolution. Full dataset statistics, brightness distributions, image quality flags, and class analysis are available in the Exploratory Data Analysis tab.
This dataset is augmented. It was constructed to reflect the statistical structure of real-world fundus photography datasets — including the disease prevalence, image quality variation, and acquisition artefacts observed in clinical screening programmes — without containing any identifiable patient data.
| Property | Value |
|---|---|
| Total images | 2,497 |
| Image dimensions | 512×512 RGB |
| Classes | Binary — normal / disease |
| Normal (label 0) | 1,336 images (53.5%) |
| Disease-positive (label 1) | 1,161 images (46.5%) |
| Class balance | Approximately balanced |
| Missing values | None |
Image quality distribution:
| Quality category | Count | Share |
|---|---|---|
| Normal brightness | 1,665 | 66.7% |
| Dark images | 659 | 26.4% |
| Bright images | 105 | 4.2% |
| Low-contrast images | 125 | 5.0% |
| Very dark (quality flag) | 35 | 1.4% |
| Very bright (quality flag) | 32 | 1.3% |
The image quality distribution is preserved exactly as observed in clinical practice. An AI system that performs at 95% accuracy on textbook-quality bright images but degrades to 85% on dark or low-contrast scans is not ready for deployment across clinic sites with varying imaging hardware. The quality breakdown in this dataset makes that degradation visible before deployment.
Each contributor submitted their AI system to the tracebloc workspace. The evaluation ran in two phases.
Phase 1 — Out-of-the-box performance. Each AI system was benchmarked on the institute's patient scan distribution as submitted, with no adaptation. This establishes the baseline: what the system actually delivers on this camera mix and patient population before any fine-tuning to local conditions.
Phase 2 — Fine-tuning. Contributors were given access to the training environment inside the tracebloc workspace. Each contributor transferred their AI system into tracebloc and ran training on the retinal image dataset. This training process fine-tuned the model weights to the specific image characteristics, brightness profiles, and pathology presentations in this patient population — adapting from a generalised retinal classifier to a system calibrated for real clinical imaging variance. After training, the adapted system was evaluated automatically against the holdout set. Patient scans never left the institute's infrastructure. Contributors received only their own results back; no contributor had visibility into another's training runs or scores before the leaderboard published.
→ View the full model leaderboard — complete system rankings, sensitivity/specificity breakdown, and image quality robustness across all submissions.
| Vendor | Claimed Accuracy | Out-of-the-Box | After Fine-tuning | Sensitivity | Specificity |
|---|---|---|---|---|---|
| Vendor A | 96.0% | 91.2% | 93.8% | 91.4% | 96.3% |
| Vendor B ✅ | 93.5% | 89.5% | 95.4% | 94.8% | 96.0% |
| Vendor C ⚠️ | 95.0% | 87.1% | 90.2% | 86.7% | 93.8% |
What the numbers reveal:
Vendor B shows the largest improvement from fine-tuning — nearly six percentage points, from 89.5% to 95.4% overall accuracy. More critically, it delivers the strongest sensitivity in the evaluation at 94.8%, meaning it misses the fewest disease-positive patient scans. For a diabetic screening programme, sensitivity is the number that determines patient outcomes, not headline accuracy.
Vendor A had the strongest out-of-the-box performance and the most creditable claimed accuracy. After fine-tuning it reaches 93.8% — a solid result, but 1.6 points below Vendor B on overall accuracy and 3.4 points below on sensitivity. At screening volume, that gap represents a meaningful difference in undetected cases.
Vendor C had the largest gap between claimed and actual performance. Its 95% claimed accuracy degraded to 87.1% on the institute's real patient scan distribution before fine-tuning — the sharpest performance collapse in the evaluation. After adapting to local conditions it recovers to 90.2%, but with sensitivity of 86.7% it would be responsible for the most missed disease cases of any system evaluated.
Illustrative assumptions: 30,000 patient scans graded per year / disease prevalence: 8% (2,400 disease-positive cases) / cost per missed diagnosis (delayed treatment, vision loss risk, follow-up): €4,500 / human grader cost per scan not auto-triaged: €12
| Strategy | Sensitivity | Missed Cases | Missed Diagnosis Cost | AI Cost (p.a.) | Manual Review Cost | Total Annual Cost |
|---|---|---|---|---|---|---|
| Internal baseline | 88% | 288 | €1,296,000 | — | €216,000 | €1,512,000 |
| Vendor A | 91.4% | 206 | €927,000 | €120,000 | €180,000 | €1,227,000 |
| Vendor B ✅ | 94.8% | 125 | €562,500 | €200,000 | €144,000 | €906,500 |
| Vendor C ⚠️ | 86.7% | 319 | €1,435,500 | €80,000 | €240,000 | €1,755,500 |
Vendor B reduces total annual cost from €1,512,000 (internal baseline) to €906,500 — a saving of over €600,000 per year — while carrying the highest sensitivity in the evaluation and meeting the CE mark evidence requirements Elena needs for multi-site deployment.
Elena selects Vendor B for CE mark submission and staged partner clinic deployment, starting with one partner site alongside the existing grading workflow. Three months of shadow operation validates that the 94.8% sensitivity holds across the partner clinic's imaging hardware, that performance on dark and low-contrast scans does not degrade below the thresholds documented in the regulatory submission, and that the system's latency meets the grading workflow's throughput requirements before any autonomous grading decisions go live.
The tracebloc workspace stays active after the initial validation. As partner clinics onboard new imaging equipment, as the patient population shifts, and as new AI systems enter the diabetic retinopathy AI market, Elena can re-validate without rebuilding the evaluation infrastructure or re-opening ethics committee discussions. The leaderboard becomes a live record of which AI systems maintain diagnostic accuracy across imaging conditions — turning a one-off regulatory submission into ongoing clinical AI governance.
Explore this use case further:
Related use cases: See how the same generalisation validation approach applies to heart disease prediction across partner hospitals and AI breast cancer screening across imaging protocols. For a broader view of what federated learning applications look like across clinical AI, see our federated learning applications guide.
Deploy your workspace or schedule a call.
Disclaimer: The dataset used in this use case is augmented — designed to closely reflect the statistical structure of real-world fundus photography datasets, including disease prevalence, image quality variation, and acquisition artefacts observed in clinical screening programmes, without containing any identifiable patient data. The persona, vendor names, claimed performance figures, business impact assumptions, and clinical scenario are illustrative and based on patterns observed across ophthalmic AI deployment environments. They do not represent any specific organisation, product, or regulatory submission.