
Safety Metabolomics for Drug-Induced Liver Injury Detection
Participants
8
End Date
18.02.27
Dataset
dvsfxok7
Resources2 CPU (8.59 GB) | 1 GPU (22.49 GB)
Compute
0 / 100.00 PF
Submits
0/5

8
18.02.27
On this page
About this use case: A clinical institution holds 1,600 hepatotoxicity patient records that pharma safety teams need to validate their DILI models on — but GDPR compliance and re-identification risk mean a formal data transfer agreement would take six to twelve months and might never clear. tracebloc lets safety teams fine-tune and benchmark their models on this independent cohort without a single patient record crossing institutional boundaries. Explore the data, submit your own model, and see how your approach compares.
Pharmacovigilance AI teams develop DILI prediction models on their internal trial safety databases — and those models perform well on internal holdout sets. The problem surfaces at regulatory submission: a model trained on one company's patient population, one drug class, one dosing protocol, has never been tested on an independent cohort with different patient demographics, different cumulative exposure histories, and different baseline metabolic profiles. Regulatory agencies expect generalisation evidence. The clinical cost of undetected severe DILI — liver failure, drug withdrawal, patient harm — makes that expectation reasonable.
Dr. Marcus Weber, Head of Clinical Data Science at a mid-size pharmaceutical company in Basel, has built a DILI severity classifier trained on his internal safety database. He deploys a tracebloc workspace loaded with 1,600 anonymised patient records from an independent clinical institution. Partner institutions — other pharma companies, academic clinical pharmacology centres — submit their DILI prediction models to the workspace. Inside tracebloc's containerised training environment, each model trains on the 1,280-patient dataset — fine-tuning to the specific metabolomic profile, dosing patterns, and clinical lab distributions present in this independent cohort — without any patient data leaving the hospital. tracebloc handles orchestration, scores each adapted model against the holdout set, and publishes results to a live leaderboard automatically. This is a federated learning application of safety model generalisation testing: the clinical data stays on the institution's infrastructure from start to finish.
In this example evaluation, the best-performing model improved its severe DILI recall substantially after fine-tuning on the independent patient population — but the gap between claimed and actual performance before adaptation was stark across all contributors. The rare severe cases (Class 3, 75 patients) drove the separation between models: one contributor achieved strong overall accuracy while missing more than half of all severe cases, a pattern invisible in headline metrics. The leaderboard surfaces those class-level differences. The workspace remains available for ongoing model updates as the institution's safety cohort grows.
Marcus's team is preparing a safety submission for a hepatotoxic compound entering Phase III. His internal DILI model — trained on 4,200 patients from his company's safety database across three completed trials — classifies injury severity into four categories: no injury, mild, moderate, and severe. On the internal holdout set it achieves strong performance. The question regulators will ask is whether it performs similarly on patients from a different institution, with a different metabolic baseline, and different cumulative exposure patterns.
This is the generalisation question. It is not a theoretical concern. Safety models trained on one trial population routinely underperform on different patient demographics, different dosing regimes, and different concomitant medication profiles. A model that learns to associate a specific ALT elevation trajectory with severe DILI in one population may not generalise to a population where baseline liver function differs, where bile acid profiles reflect different dietary patterns, or where creatine kinase signals are confounded by different physical activity levels.
The severe DILI class is the critical one. With 75 severe cases in 1,600 patients — a 76-fold imbalance against the no-injury class — missing those cases has direct patient safety consequences. A model that optimises for overall accuracy achieves 90.3% by classifying most patients as no-injury. That number is meaningless for a safety submission. What matters is recall on Class 3.
Marcus's problem is structural. The independent cohort he needs for generalisation testing sits at a clinical partner institution. The data is exactly what he needs — real hepatotoxicity patients, real metabolomic panels, real dosing records. But that institution cannot transfer patient data to his company's servers. A formal data transfer agreement would take six to twelve months and require patient consent re-collection that is practically impossible for a retrospective cohort. Without external validation, his regulatory submission rests on internal evidence alone.
He needs a way to test his model on the independent cohort — calibrating it to the external population's metabolic patterns — without a single patient record ever leaving the clinical institution.
The evaluation dataset contains 1,600 anonymised patient records with metabolomic, clinical laboratory, and dosing data. Full dataset statistics, class distributions, and feature analysis are available in the Exploratory Data Analysis tab.
This dataset is augmented. It was constructed to reflect the statistical structure of real-world hepatotoxicity safety cohorts — the severe class rarity, the clinical lab value distributions for ALT, AST, bilirubin, and bile acids, and the correlation between cumulative dose and injury severity — without containing any identifiable patient information.
| Property | Value |
|---|---|
| Total records | 1,600 |
| Training cohort | 1,280 records |
| Holdout cohort | 320 records |
| Features | 203 (201 numerical, 1 categorical label + patient_id) |
| DILI severity classes | 4 |
| Missing values | None |
| Class imbalance ratio | 76× (no-injury vs. severe) |
| Evaluation metric | Macro-F1 and Class 3 recall |
DILI severity class distribution (full dataset):
| Class | Severity | Patients | Share |
|---|---|---|---|
| 0 | No injury | 1,445 | 90.3% |
| 1 | Mild | 19 | 1.2% |
| 2 | Moderate | 61 | 3.8% |
| 3 | Severe | 75 | 4.7% |
A note on the features: The 201 numerical features span four domains. Clinical liver function markers include ALT, AST, alkaline phosphatase, GGT, and bilirubin — the standard hepatotoxicity signal panel. Metabolomic features include named lipid species (Lipids_1 through Lipids_57+) and bile acid markers (Bile_Acids_1 through Bile_Acids_6+). Dosing variables — days_on_drug, dose_mg, cumulative_dose_mg, weeks_on_treatment — are the strongest predictors of severity: cumulative dose mean rises from 704 mg in no-injury patients to 1,973 mg in severe cases. Patient demographics (age, weight_kg, diabetes status) complete the feature set. The 76× class imbalance exactly mirrors what a real-world hepatotoxicity surveillance cohort looks like: most patients tolerate the drug; the rare severe cases are the ones that matter clinically and regulatorily.
Each contributor submitted their DILI severity prediction model to the tracebloc workspace. The evaluation ran in two phases.
Phase 1 — Out-of-the-box performance. Each model was scored as submitted, with no adaptation to the independent cohort's patient population. This establishes what the internally trained model actually delivers when applied to a different clinical institution's data — the generalisation test that regulatory reviewers will effectively be applying.
Phase 2 — Fine-tuning. Contributors were given access to the training environment inside the tracebloc workspace. Each contributor transferred their model into tracebloc and ran training on the 1,280-patient cohort. This process fine-tuned the model to the specific metabolomic profiles, bile acid patterns, and cumulative dose distributions of the independent population — adapting from a generalised DILI classifier to one calibrated for this institution's patient characteristics. After training, the adapted model was evaluated automatically against the 320-record holdout. Contributors received only their own results; no contributor had visibility into another's performance before the leaderboard published.
→ View the full model leaderboard — complete contributor rankings, per-class recall, and calibration plots across all submissions.
| Contributor | Claimed F1 | Out-of-the-Box | After Fine-tuning | Severe DILI Recall |
|---|---|---|---|---|
| Contributor A | 0.82 | 0.64 | 0.71 | 0.51 |
| Contributor B ✅ | 0.79 | 0.68 | 0.81 | 0.73 |
| Contributor C ⚠️ | 0.85 | 0.72 | 0.77 | 0.44 |
What the numbers reveal:
Contributor B improved beyond its claimed macro-F1 after fine-tuning on the independent cohort, reaching 0.81 while achieving the strongest severe DILI recall in the evaluation at 0.73. It started at 0.68 out-of-the-box — a meaningful generalisation gap from its claimed 0.79, but not a collapse — and the fine-tuning process on 1,280 real-distribution patients brought it to the strongest all-round performance, with its severe case recall driven by the model's ability to pick up the cumulative dose and ALT trajectory signals that distinguish Class 3 from Class 2 in this population.
Contributor C had the highest claimed F1 at 0.85 and the strongest out-of-the-box performance at 0.72. After fine-tuning it reached 0.77 — a reasonable result, but its severe DILI recall of 0.44 means it misses more than half of all severe cases. On a drug in Phase III with a hepatotoxic profile, a pharmacovigilance model that detects 44% of severe liver injury events is a safety liability, not a safety tool.
Contributor A showed the largest claimed-to-actual gap: 0.82 claimed, 0.64 on the independent cohort before adaptation. After fine-tuning it recovered to 0.71, but its severe recall of 0.51 — catching roughly one in two severe cases — remains below the minimum threshold for deployment in a clinical safety monitoring context.
Illustrative assumptions: Phase III trial with 800 patients on a hepatotoxic compound / Undetected severe DILI rate estimated at 4.7% of exposed population / Cost of one undetected severe DILI event: regulatory delay, programme review, patient harm exposure — estimated €500K–€2M per episode / Regulatory submission delay from insufficient generalisation evidence: 12–18 months
| Strategy | Severe DILI Recall | Estimated Missed Cases (800 patients) | Regulatory Validation Evidence | Submission Risk |
|---|---|---|---|---|
| Unvalidated internal model | Unknown | Unknown | Single cohort only | High |
| Contributor A | 0.51 | ~18 missed | Partial — moderate generalisation gap | Moderate |
| Contributor B ✅ | 0.73 | ~10 missed | Strong — cross-cohort evidence | Low |
| Contributor C | 0.44 | ~21 missed | Partial — high severe miss rate | High |
The regulatory value of this evaluation is not primarily financial — it is evidentiary. A generalisation result documented on an independent patient cohort, achieved through a validated federated evaluation methodology, constitutes the kind of external validation evidence that can accompany a safety biomarker submission to EMA or FDA. Contributor B provides that evidence. Running the same evaluation through a conventional data transfer would have required six to twelve months of governance, patient consent renegotiation, and legal review — and the data would have left the institution.
Marcus selects Contributor B's model for the safety submission package, with the independent cohort fine-tuning result constituting the external validation evidence required for regulatory documentation. The feature importance output from the fine-tuning run confirms that cumulative dose, ALT at peak exposure, and bile acid elevation are the dominant predictors of severe injury in the independent population — consistent with the pharmacological mechanism, which strengthens the submission narrative.
The tracebloc workspace stays active after the initial evaluation. As the institution adds patients to its retrospective cohort, or as new therapeutic programmes require DILI model validation, the same infrastructure can be reused. New model versions from Contributor B — or candidate models from other partners — can be evaluated on the same holdout set without rebuilding the evaluation pipeline. The leaderboard records which models hold up on independent data and which do not, turning a one-off regulatory requirement into ongoing pharmacovigilance model governance.
Explore this use case further:
Related use cases: See how the same generalisation validation approach applies to liquid biopsy classification in paediatric rare disease and omics biomarker panel narrowing across rare disease cohorts. For a broader view of what federated learning applications look like in preclinical toxicology and pharmacovigilance, see our guide.
Deploy your workspace or schedule a call.
Disclaimer: The dataset used in this use case is augmented — constructed to reflect the statistical structure of real-world hepatotoxicity safety cohorts, including DILI severity class distribution, clinical liver function marker distributions, metabolomic feature patterns, and dose-response relationships, without containing any identifiable patient information. The persona, contributor names, claimed performance figures, business impact assumptions, and regulatory scenario are illustrative and based on patterns observed across pharmaceutical safety and clinical data science environments. They do not represent any specific company, drug programme, institution, or regulatory submission.