Fine-Tune LLMs on Medical Abstracts Without Uploading Patient Data

Participants

End Date

13.05.27

Dataset

d6f091aq

Resources2 CPU (8.59 GB) | 1 GPU (22.49 GB)

Compute

0 / 300.00 PF

Submits

0/5

On this page

Overview

About this use case: A university hospital holds 14,438 annotated medical abstracts — the proprietary training corpus that would turn a generic biomedical language model into a domain-calibrated evidence screening tool — and uploading it to any external NLP API is off the table. tracebloc fine-tunes vendor models on the corpus inside the hospital's own infrastructure, so the annotation work never leaves and the comparison is done on real data. Explore the data, submit your own model, and see how your approach compares.

Problem

Dr. Lena Fischer, Head of Clinical Evidence at a European university hospital, runs a team responsible for abstract screening across five disease areas: general pathological conditions, neoplasms, cardiovascular diseases, nervous system diseases, and digestive system diseases. Every week, hundreds of new publications cross her team's desk. Her internal literature review AI — trained on generic biomedical text — misses the therapeutic area terminology that actually determines whether a paper is relevant to her guidelines program. Improving it requires fine-tuning on her annotated abstract library. Uploading that corpus to any external API is out of the question.

Solution

Lena deploys a tracebloc workspace loaded with 14,438 annotated medical abstracts — 11,550 for fine-tuning, 2,888 held out for evaluation. Contributors submit their language models to the workspace. Inside tracebloc's containerised training environment, each model trains on Lena's annotated corpus — fine-tuning its weights to the specific terminology, sentence structure, and class patterns in her abstract library, adapting from a generalised language model to a system calibrated for her therapeutic areas. This is a federated learning application of private LLM fine-tuning: the abstract corpus stays on Lena's infrastructure throughout. tracebloc orchestrates training, evaluates each fine-tuned model against the holdout set, and publishes results to a live leaderboard.

Outcome

In this example evaluation, the top-performing model improved its macro-F1 by eight percentage points after fine-tuning on Lena's annotated abstracts — demonstrating that generic biomedical language models adapt substantially when trained on domain-specific text. The vendor with the highest claimed performance degraded most on Lena's real abstract distribution, confirming that therapeutic area specificity is not transferable from generic benchmarks. The workspace stays active for continuous re-evaluation as new model architectures emerge and Lena's annotation corpus grows. See the live leaderboard for current rankings.

The Operational Challenge

Lena's team produces clinical evidence summaries for guideline committees, tumor boards, and regulatory submissions. The evidence pipeline starts with abstract screening: sorting incoming publications into disease categories, flagging high-relevance papers for full-text review, and ensuring that nothing clinically important is missed. At current publication volumes, manual screening at full recall is not achievable without expanding headcount significantly.

The internal classifier was built three years ago on a general biomedical corpus. It handles mainstream categories reasonably well but consistently underperforms on specialty-specific language. Cardiovascular abstracts that discuss novel endpoint definitions get misfiled. Nervous system papers using the institution's specific terminology for disease severity classification slip through. Digestive system abstracts — the smallest class at 10.3% of the corpus — have recall low enough that Lena's team runs a manual check on everything the model flags as irrelevant in that category. The secondary keyword they care about internally is systematic review AI: can a model support the full evidence synthesis workflow, not just triage?

The procurement constraint is straightforward: Lena's CISO and legal team will not approve uploading annotated clinical abstracts to any external API for fine-tuning. Several SaaS NLP providers have offered hosted fine-tuning services, but the data governance answer is the same regardless of the provider's security certifications. The annotated corpus reflects clinical judgment — which papers are relevant to which disease areas — and that judgment is itself proprietary. The only acceptable model is one that comes to the data, not the other way around.

Three NLP vendors have been shortlisted. All three claim strong macro-F1 on published biomedical benchmarks. None of those benchmarks reflect the therapeutic area specificity, annotation style, or class distribution in Lena's corpus.

Stakeholders

Dr. Lena Fischer, Head of Clinical Evidence: Owns evidence pipeline quality, recall on relevant publications, and time-to-guideline for clinical research teams. KPIs: macro-F1 across 5 disease classes, recall on digestive and nervous system categories, abstracts screened per week
Medical Information Lead: Responsible for maintaining currency of the literature database — a missed paper in a drug safety update has direct regulatory consequences
Chief Information Security Officer: Data governance authority for any processing of annotated clinical research outside the hospital's infrastructure
Director of Research Operations: Manages the annotation team whose labeled data constitutes the fine-tuning corpus — protecting the quality and provenance of those labels
Head of Medical Affairs: Uses the evidence pipeline outputs for regulatory submissions and clinical guideline participation — needs documented, auditable model performance

The Underlying Dataset

The fine-tuning dataset contains 14,438 annotated medical abstracts split across a training set of 11,550 records and a holdout set of 2,888 records. The 80/20 split is preserved across both sets with near-identical class distributions. Full dataset statistics, length distributions, and class analysis are available in the Exploratory Data Analysis tab.

This dataset is augmented. It was constructed to reflect the statistical structure of real-world medical abstract corpora — the class distribution, text length ranges, and linguistic complexity — without containing any identifiable patient data, study identifiers, or institution-specific clinical information.

Property	Value
Total abstracts	14,438
Training set	11,550 abstracts
Holdout set	2,888 abstracts
Train/test split	80% / 20%
Classes	5 disease categories
Average abstract length	~180 words (~1,229 characters)
Text length range	24–556+ words
Missing values	None

Class distribution (training set):

Class	Category	Share
1	General Pathological Conditions	33.3%
2	Neoplasms	21.9%
3	Cardiovascular Diseases	21.1%
4	Nervous System Diseases	13.3%
5	Digestive System Diseases	10.3%

The class distribution is moderately imbalanced (class standard deviation: 8.93%). General pathological conditions account for one in three abstracts; digestive system diseases account for one in ten. A model that classifies every abstract as general pathological conditions achieves 33.3% accuracy — which is why macro-F1 across all five classes is the evaluation metric, not overall accuracy.

A note on text variability: Cardiovascular abstracts tend toward longer, more complex sentence structures. Nervous system abstracts are shorter but linguistically dense. Oncology (neoplasms) abstracts frequently include multi-step causal reasoning. Digestive system abstracts are highly specific. A model that performs well on average across this distribution is one that has genuinely adapted to the therapeutic area — not one that has overfit to the majority class.

How Evaluation Works

Each contributor submitted their language model to the tracebloc workspace. The evaluation ran in two phases.

Phase 1 — Out-of-the-box performance. Each model was benchmarked as submitted, with no fine-tuning on Lena's abstract corpus. This establishes the true baseline: what the model actually delivers on clinical text from this therapeutic area before any domain adaptation.

Phase 2 — Fine-tuning. Contributors were given access to the training environment inside the tracebloc workspace. Each contributor transferred their model into tracebloc and ran training on the 11,550-record annotated corpus. This training process fine-tuned the model weights to Lena's specific class structure, annotation conventions, and therapeutic area vocabulary — adapting from a generalised biomedical language model to a system calibrated for her evidence pipeline. After training, the adapted model was evaluated automatically against the 2,888-record holdout set. The abstract corpus never left Lena's infrastructure. Contributors received only their own results back; no contributor had visibility into another's training runs or scores before the leaderboard published.

Each contributor received:

Training access: 11,550 annotated abstracts across all 5 disease categories for model fine-tuning inside the workspace
Evaluation environment: Sandboxed execution — fine-tuned models evaluated against the holdout set, no data export path available
Metrics tracked: Macro-F1 across 5 classes, per-class recall (with emphasis on Classes 4 and 5 — nervous system and digestive system diseases), performance across text-length buckets (short: <100 words, medium: 100–300 words, long: >300 words)
Key constraint: Recall on the two smallest classes (nervous system and digestive system) weighted in final selection — these are the categories most likely to contain clinically significant missed-evidence cases

Results

→ View the full model leaderboard — complete rankings, per-class F1 scores, and text-length robustness across all submissions.

Vendor	Claimed Macro-F1	Out-of-the-Box	After Fine-tuning	Recall: Digestive	Recall: Nervous
Vendor A	88%	81%	87%	79%	82%
Vendor B ✅	90%	79%	89%	86%	88%
Vendor C ⚠️	92%	76%	85%	71%	78%

What the numbers reveal:

Vendor B shows the sharpest performance gain from fine-tuning — ten percentage points, from 79% to 89% macro-F1. Starting with lower out-of-the-box performance than the other two vendors, it adapts most effectively to Lena's specific annotation style and class distribution after training on 11,550 domain-specific abstracts inside the tracebloc workspace. Crucially, it delivers the strongest recall on both small classes: 86% on digestive system diseases and 88% on nervous system diseases — the categories where missed evidence has the most direct clinical consequences.

Vendor C had the most aggressive claimed performance at 92% macro-F1. Out-of-the-box it delivered the worst baseline in the evaluation at 76%, suggesting the published benchmark reflects a data distribution substantially different from Lena's corpus. After fine-tuning it reaches 85% — a creditable result, but 4 points below Vendor B, with notably lower recall on the two clinically critical small classes.

Vendor A lands in the middle on overall macro-F1 but trails Vendor B on the small classes that determine whether the evidence pipeline can actually replace the manual check Lena's team currently runs on digestive system abstracts.

Business Impact

Illustrative assumptions: 50,000 abstracts screened per year / 0.5 FTE clinical reviewer cost: €90,000 per year / missed-evidence cost (regulatory rework, delayed guideline update): €15,000 per missed paper / current miss rate on small classes: ~25% / abstract-level miss rate at Vendor B recall: ~12%

Strategy	Macro-F1	Missed Papers (est.)	Miss Cost	AI Cost (p.a.)	Reviewer FTE	Total Annual Cost
Internal baseline	78%	~2,750	€41.3M	—	0.5 FTE	€41.3M+
Vendor A	87%	~1,300	€19.5M	€80,000	0.3 FTE	€19.6M+
Vendor B ✅	89%	~1,000	€15.0M	€150,000	0.25 FTE	€15.2M+
Vendor C	85%	~1,500	€22.5M	€100,000	0.35 FTE	€22.6M+

The miss-cost estimates are illustrative; the actual cost of a missed paper in a regulatory submission or guideline update varies significantly by context. The directional finding holds across a wide range of assumptions: Vendor B's advantage on small-class recall compounds into the categories where misses are most consequential.

Decision

Lena selects Vendor B for integration into the evidence pipeline, initially in parallel with the current classifier across the digestive system and nervous system disease categories. Three months of shadow operation validates that the 86% and 88% recall figures hold at full publication volume, that the model handles new writing styles from recently published journals, and that the annotation team's quality standards for the holdout evaluation translate into production accuracy.

The tracebloc workspace stays active after the initial evaluation. As Lena's annotation corpus grows and new model releases appear, the fine-tuning and evaluation cycle runs again without rebuilding infrastructure or renegotiating data governance. The leaderboard becomes a live record of which models are delivering systematic review AI capability on Lena's therapeutic areas — turning a one-off vendor selection into ongoing research paper classification governance.

Explore this use case further:

View the model leaderboard — full rankings, per-class F1, text-length robustness
Explore the dataset — class distribution, abstract length statistics, linguistic complexity
Start training — submit your own language model for fine-tuning on this abstract corpus

Related use cases: See how the same secure training approach applies to radiation therapy optimisation in prostate cancer research and retinal disease classification across clinical sites. For a broader view of what federated learning applications look like across healthcare, see our federated learning applications guide.

Deploy your workspace or schedule a call.

Disclaimer

Disclaimer: The dataset used in this use case is augmented — designed to closely reflect the statistical structure of real-world medical abstract corpora, including class distribution, text length ranges, and linguistic complexity across disease categories, without containing any identifiable patient data, study identifiers, or institution-specific clinical information. The persona, vendor names, claimed performance figures, business impact assumptions, and procurement scenario are illustrative and based on patterns observed across hospital and pharma research environments. They do not represent any specific organisation, product, or contractual outcome.