
Fine-Tune LLMs on Medical Abstracts Without Uploading Patient Data
Participants
9
End Date
13.05.27
Dataset
d6f091aq
Resources2 CPU (8.59 GB) | 1 GPU (22.49 GB)
Compute
0 / 300.00 PF
Submits
0/5

9
13.05.27
On this page
About this use case: A university hospital holds 14,438 annotated medical abstracts — the proprietary training corpus that would turn a generic biomedical language model into a domain-calibrated evidence screening tool — and uploading it to any external NLP API is off the table. tracebloc fine-tunes vendor models on the corpus inside the hospital's own infrastructure, so the annotation work never leaves and the comparison is done on real data. Explore the data, submit your own model, and see how your approach compares.
Dr. Lena Fischer, Head of Clinical Evidence at a European university hospital, runs a team responsible for abstract screening across five disease areas: general pathological conditions, neoplasms, cardiovascular diseases, nervous system diseases, and digestive system diseases. Every week, hundreds of new publications cross her team's desk. Her internal literature review AI — trained on generic biomedical text — misses the therapeutic area terminology that actually determines whether a paper is relevant to her guidelines program. Improving it requires fine-tuning on her annotated abstract library. Uploading that corpus to any external API is out of the question.
Lena deploys a tracebloc workspace loaded with 14,438 annotated medical abstracts — 11,550 for fine-tuning, 2,888 held out for evaluation. Contributors submit their language models to the workspace. Inside tracebloc's containerised training environment, each model trains on Lena's annotated corpus — fine-tuning its weights to the specific terminology, sentence structure, and class patterns in her abstract library, adapting from a generalised language model to a system calibrated for her therapeutic areas. This is a federated learning application of private LLM fine-tuning: the abstract corpus stays on Lena's infrastructure throughout. tracebloc orchestrates training, evaluates each fine-tuned model against the holdout set, and publishes results to a live leaderboard.
In this example evaluation, the top-performing model improved its macro-F1 by eight percentage points after fine-tuning on Lena's annotated abstracts — demonstrating that generic biomedical language models adapt substantially when trained on domain-specific text. The vendor with the highest claimed performance degraded most on Lena's real abstract distribution, confirming that therapeutic area specificity is not transferable from generic benchmarks. The workspace stays active for continuous re-evaluation as new model architectures emerge and Lena's annotation corpus grows. See the live leaderboard for current rankings.
Lena's team produces clinical evidence summaries for guideline committees, tumor boards, and regulatory submissions. The evidence pipeline starts with abstract screening: sorting incoming publications into disease categories, flagging high-relevance papers for full-text review, and ensuring that nothing clinically important is missed. At current publication volumes, manual screening at full recall is not achievable without expanding headcount significantly.
The internal classifier was built three years ago on a general biomedical corpus. It handles mainstream categories reasonably well but consistently underperforms on specialty-specific language. Cardiovascular abstracts that discuss novel endpoint definitions get misfiled. Nervous system papers using the institution's specific terminology for disease severity classification slip through. Digestive system abstracts — the smallest class at 10.3% of the corpus — have recall low enough that Lena's team runs a manual check on everything the model flags as irrelevant in that category. The secondary keyword they care about internally is systematic review AI: can a model support the full evidence synthesis workflow, not just triage?
The procurement constraint is straightforward: Lena's CISO and legal team will not approve uploading annotated clinical abstracts to any external API for fine-tuning. Several SaaS NLP providers have offered hosted fine-tuning services, but the data governance answer is the same regardless of the provider's security certifications. The annotated corpus reflects clinical judgment — which papers are relevant to which disease areas — and that judgment is itself proprietary. The only acceptable model is one that comes to the data, not the other way around.
Three NLP vendors have been shortlisted. All three claim strong macro-F1 on published biomedical benchmarks. None of those benchmarks reflect the therapeutic area specificity, annotation style, or class distribution in Lena's corpus.
The fine-tuning dataset contains 14,438 annotated medical abstracts split across a training set of 11,550 records and a holdout set of 2,888 records. The 80/20 split is preserved across both sets with near-identical class distributions. Full dataset statistics, length distributions, and class analysis are available in the Exploratory Data Analysis tab.
This dataset is augmented. It was constructed to reflect the statistical structure of real-world medical abstract corpora — the class distribution, text length ranges, and linguistic complexity — without containing any identifiable patient data, study identifiers, or institution-specific clinical information.
| Property | Value |
|---|---|
| Total abstracts | 14,438 |
| Training set | 11,550 abstracts |
| Holdout set | 2,888 abstracts |
| Train/test split | 80% / 20% |
| Classes | 5 disease categories |
| Average abstract length | ~180 words (~1,229 characters) |
| Text length range | 24–556+ words |
| Missing values | None |
Class distribution (training set):
| Class | Category | Share |
|---|---|---|
| 1 | General Pathological Conditions | 33.3% |
| 2 | Neoplasms | 21.9% |
| 3 | Cardiovascular Diseases | 21.1% |
| 4 | Nervous System Diseases | 13.3% |
| 5 | Digestive System Diseases | 10.3% |
The class distribution is moderately imbalanced (class standard deviation: 8.93%). General pathological conditions account for one in three abstracts; digestive system diseases account for one in ten. A model that classifies every abstract as general pathological conditions achieves 33.3% accuracy — which is why macro-F1 across all five classes is the evaluation metric, not overall accuracy.
A note on text variability: Cardiovascular abstracts tend toward longer, more complex sentence structures. Nervous system abstracts are shorter but linguistically dense. Oncology (neoplasms) abstracts frequently include multi-step causal reasoning. Digestive system abstracts are highly specific. A model that performs well on average across this distribution is one that has genuinely adapted to the therapeutic area — not one that has overfit to the majority class.
Each contributor submitted their language model to the tracebloc workspace. The evaluation ran in two phases.
Phase 1 — Out-of-the-box performance. Each model was benchmarked as submitted, with no fine-tuning on Lena's abstract corpus. This establishes the true baseline: what the model actually delivers on clinical text from this therapeutic area before any domain adaptation.
Phase 2 — Fine-tuning. Contributors were given access to the training environment inside the tracebloc workspace. Each contributor transferred their model into tracebloc and ran training on the 11,550-record annotated corpus. This training process fine-tuned the model weights to Lena's specific class structure, annotation conventions, and therapeutic area vocabulary — adapting from a generalised biomedical language model to a system calibrated for her evidence pipeline. After training, the adapted model was evaluated automatically against the 2,888-record holdout set. The abstract corpus never left Lena's infrastructure. Contributors received only their own results back; no contributor had visibility into another's training runs or scores before the leaderboard published.
→ View the full model leaderboard — complete rankings, per-class F1 scores, and text-length robustness across all submissions.
| Vendor | Claimed Macro-F1 | Out-of-the-Box | After Fine-tuning | Recall: Digestive | Recall: Nervous |
|---|---|---|---|---|---|
| Vendor A | 88% | 81% | 87% | 79% | 82% |
| Vendor B ✅ | 90% | 79% | 89% | 86% | 88% |
| Vendor C ⚠️ | 92% | 76% | 85% | 71% | 78% |
What the numbers reveal:
Vendor B shows the sharpest performance gain from fine-tuning — ten percentage points, from 79% to 89% macro-F1. Starting with lower out-of-the-box performance than the other two vendors, it adapts most effectively to Lena's specific annotation style and class distribution after training on 11,550 domain-specific abstracts inside the tracebloc workspace. Crucially, it delivers the strongest recall on both small classes: 86% on digestive system diseases and 88% on nervous system diseases — the categories where missed evidence has the most direct clinical consequences.
Vendor C had the most aggressive claimed performance at 92% macro-F1. Out-of-the-box it delivered the worst baseline in the evaluation at 76%, suggesting the published benchmark reflects a data distribution substantially different from Lena's corpus. After fine-tuning it reaches 85% — a creditable result, but 4 points below Vendor B, with notably lower recall on the two clinically critical small classes.
Vendor A lands in the middle on overall macro-F1 but trails Vendor B on the small classes that determine whether the evidence pipeline can actually replace the manual check Lena's team currently runs on digestive system abstracts.
Illustrative assumptions: 50,000 abstracts screened per year / 0.5 FTE clinical reviewer cost: €90,000 per year / missed-evidence cost (regulatory rework, delayed guideline update): €15,000 per missed paper / current miss rate on small classes: ~25% / abstract-level miss rate at Vendor B recall: ~12%
| Strategy | Macro-F1 | Missed Papers (est.) | Miss Cost | AI Cost (p.a.) | Reviewer FTE | Total Annual Cost |
|---|---|---|---|---|---|---|
| Internal baseline | 78% | ~2,750 | €41.3M | — | 0.5 FTE | €41.3M+ |
| Vendor A | 87% | ~1,300 | €19.5M | €80,000 | 0.3 FTE | €19.6M+ |
| Vendor B ✅ | 89% | ~1,000 | €15.0M | €150,000 | 0.25 FTE | €15.2M+ |
| Vendor C | 85% | ~1,500 | €22.5M | €100,000 | 0.35 FTE | €22.6M+ |
The miss-cost estimates are illustrative; the actual cost of a missed paper in a regulatory submission or guideline update varies significantly by context. The directional finding holds across a wide range of assumptions: Vendor B's advantage on small-class recall compounds into the categories where misses are most consequential.
Lena selects Vendor B for integration into the evidence pipeline, initially in parallel with the current classifier across the digestive system and nervous system disease categories. Three months of shadow operation validates that the 86% and 88% recall figures hold at full publication volume, that the model handles new writing styles from recently published journals, and that the annotation team's quality standards for the holdout evaluation translate into production accuracy.
The tracebloc workspace stays active after the initial evaluation. As Lena's annotation corpus grows and new model releases appear, the fine-tuning and evaluation cycle runs again without rebuilding infrastructure or renegotiating data governance. The leaderboard becomes a live record of which models are delivering systematic review AI capability on Lena's therapeutic areas — turning a one-off vendor selection into ongoing research paper classification governance.
Explore this use case further:
Related use cases: See how the same secure training approach applies to radiation therapy optimisation in prostate cancer research and retinal disease classification across clinical sites. For a broader view of what federated learning applications look like across healthcare, see our federated learning applications guide.
Deploy your workspace or schedule a call.
Disclaimer: The dataset used in this use case is augmented — designed to closely reflect the statistical structure of real-world medical abstract corpora, including class distribution, text length ranges, and linguistic complexity across disease categories, without containing any identifiable patient data, study identifiers, or institution-specific clinical information. The persona, vendor names, claimed performance figures, business impact assumptions, and procurement scenario are illustrative and based on patterns observed across hospital and pharma research environments. They do not represent any specific organisation, product, or contractual outcome.