Safety Metabolomics for Drug-Induced Liver Injury Detection

Participants

End Date

18.02.27

Dataset

dvsfxok7

Resources2 CPU (8.59 GB) | 1 GPU (22.49 GB)

Compute

0 / 100.00 PF

Submits

0/5

On this page

Overview

About this use case: A clinical institution holds 1,600 hepatotoxicity patient records that pharma safety teams need to validate their DILI models on — but GDPR compliance and re-identification risk mean a formal data transfer agreement would take six to twelve months and might never clear. tracebloc lets safety teams fine-tune and benchmark their models on this independent cohort without a single patient record crossing institutional boundaries. Explore the data, submit your own model, and see how your approach compares.

Problem

Pharmacovigilance AI teams develop DILI prediction models on their internal trial safety databases — and those models perform well on internal holdout sets. The problem surfaces at regulatory submission: a model trained on one company's patient population, one drug class, one dosing protocol, has never been tested on an independent cohort with different patient demographics, different cumulative exposure histories, and different baseline metabolic profiles. Regulatory agencies expect generalisation evidence. The clinical cost of undetected severe DILI — liver failure, drug withdrawal, patient harm — makes that expectation reasonable.

Solution

Dr. Marcus Weber, Head of Clinical Data Science at a mid-size pharmaceutical company in Basel, has built a DILI severity classifier trained on his internal safety database. He deploys a tracebloc workspace loaded with 1,600 anonymised patient records from an independent clinical institution. Partner institutions — other pharma companies, academic clinical pharmacology centres — submit their DILI prediction models to the workspace. Inside tracebloc's containerised training environment, each model trains on the 1,280-patient dataset — fine-tuning to the specific metabolomic profile, dosing patterns, and clinical lab distributions present in this independent cohort — without any patient data leaving the hospital. tracebloc handles orchestration, scores each adapted model against the holdout set, and publishes results to a live leaderboard automatically. This is a federated learning application of safety model generalisation testing: the clinical data stays on the institution's infrastructure from start to finish.

Outcome

In this example evaluation, the best-performing model improved its severe DILI recall substantially after fine-tuning on the independent patient population — but the gap between claimed and actual performance before adaptation was stark across all contributors. The rare severe cases (Class 3, 75 patients) drove the separation between models: one contributor achieved strong overall accuracy while missing more than half of all severe cases, a pattern invisible in headline metrics. The leaderboard surfaces those class-level differences. The workspace remains available for ongoing model updates as the institution's safety cohort grows.

The Operational Challenge

Marcus's team is preparing a safety submission for a hepatotoxic compound entering Phase III. His internal DILI model — trained on 4,200 patients from his company's safety database across three completed trials — classifies injury severity into four categories: no injury, mild, moderate, and severe. On the internal holdout set it achieves strong performance. The question regulators will ask is whether it performs similarly on patients from a different institution, with a different metabolic baseline, and different cumulative exposure patterns.

This is the generalisation question. It is not a theoretical concern. Safety models trained on one trial population routinely underperform on different patient demographics, different dosing regimes, and different concomitant medication profiles. A model that learns to associate a specific ALT elevation trajectory with severe DILI in one population may not generalise to a population where baseline liver function differs, where bile acid profiles reflect different dietary patterns, or where creatine kinase signals are confounded by different physical activity levels.

The severe DILI class is the critical one. With 75 severe cases in 1,600 patients — a 76-fold imbalance against the no-injury class — missing those cases has direct patient safety consequences. A model that optimises for overall accuracy achieves 90.3% by classifying most patients as no-injury. That number is meaningless for a safety submission. What matters is recall on Class 3.

Marcus's problem is structural. The independent cohort he needs for generalisation testing sits at a clinical partner institution. The data is exactly what he needs — real hepatotoxicity patients, real metabolomic panels, real dosing records. But that institution cannot transfer patient data to his company's servers. A formal data transfer agreement would take six to twelve months and require patient consent re-collection that is practically impossible for a retrospective cohort. Without external validation, his regulatory submission rests on internal evidence alone.

He needs a way to test his model on the independent cohort — calibrating it to the external population's metabolic patterns — without a single patient record ever leaving the clinical institution.

Stakeholders

Dr. Marcus Weber, Head of Clinical Data Science: Owns the DILI model, its performance claims, and its regulatory submission documentation. KPIs: severe DILI recall, false positive rate on mild cases, model generalisation evidence for IND package
Chief Medical Officer: Responsible for patient safety decisions — an undetected severe DILI during Phase III carries both patient harm and programme risk
Regulatory Affairs Lead: Needs documented external validation evidence to meet EMA and FDA expectations for AI-assisted safety biomarker submissions; single-cohort evidence is increasingly insufficient
Clinical Pharmacology Partner (external institution): Holds the independent cohort; willing to support external validation but cannot share patient records under their ethics approval or institutional data governance policy
Data Governance / Legal: GDPR and research ethics compliance for any cross-institutional data use; formal agreements take months and may require patient consent re-collection

The Underlying Dataset

The evaluation dataset contains 1,600 anonymised patient records with metabolomic, clinical laboratory, and dosing data. Full dataset statistics, class distributions, and feature analysis are available in the Exploratory Data Analysis tab.

This dataset is augmented. It was constructed to reflect the statistical structure of real-world hepatotoxicity safety cohorts — the severe class rarity, the clinical lab value distributions for ALT, AST, bilirubin, and bile acids, and the correlation between cumulative dose and injury severity — without containing any identifiable patient information.

Property	Value
Total records	1,600
Training cohort	1,280 records
Holdout cohort	320 records
Features	203 (201 numerical, 1 categorical label + patient_id)
DILI severity classes	4
Missing values	None
Class imbalance ratio	76× (no-injury vs. severe)
Evaluation metric	Macro-F1 and Class 3 recall

DILI severity class distribution (full dataset):

Class	Severity	Patients	Share
0	No injury	1,445	90.3%
1	Mild	19	1.2%
2	Moderate	61	3.8%
3	Severe	75	4.7%

A note on the features: The 201 numerical features span four domains. Clinical liver function markers include ALT, AST, alkaline phosphatase, GGT, and bilirubin — the standard hepatotoxicity signal panel. Metabolomic features include named lipid species (Lipids_1 through Lipids_57+) and bile acid markers (Bile_Acids_1 through Bile_Acids_6+). Dosing variables — days_on_drug, dose_mg, cumulative_dose_mg, weeks_on_treatment — are the strongest predictors of severity: cumulative dose mean rises from 704 mg in no-injury patients to 1,973 mg in severe cases. Patient demographics (age, weight_kg, diabetes status) complete the feature set. The 76× class imbalance exactly mirrors what a real-world hepatotoxicity surveillance cohort looks like: most patients tolerate the drug; the rare severe cases are the ones that matter clinically and regulatorily.

How Evaluation Works

Each contributor submitted their DILI severity prediction model to the tracebloc workspace. The evaluation ran in two phases.

Phase 1 — Out-of-the-box performance. Each model was scored as submitted, with no adaptation to the independent cohort's patient population. This establishes what the internally trained model actually delivers when applied to a different clinical institution's data — the generalisation test that regulatory reviewers will effectively be applying.

Phase 2 — Fine-tuning. Contributors were given access to the training environment inside the tracebloc workspace. Each contributor transferred their model into tracebloc and ran training on the 1,280-patient cohort. This process fine-tuned the model to the specific metabolomic profiles, bile acid patterns, and cumulative dose distributions of the independent population — adapting from a generalised DILI classifier to one calibrated for this institution's patient characteristics. After training, the adapted model was evaluated automatically against the 320-record holdout. Contributors received only their own results; no contributor had visibility into another's performance before the leaderboard published.

Each contributor received:

Training access: 1,280 anonymised patient records (201 metabolomic and clinical features, 4 severity classes at realistic distribution) for model fine-tuning inside the workspace
Evaluation environment: Sandboxed execution — adapted models run against the holdout set, no patient data export path available
Metrics tracked: Macro-F1 score, per-class recall (especially Class 2 moderate and Class 3 severe), overall accuracy, and calibration of predicted severity probabilities
Key constraint: Class 3 (severe) recall is the primary selection criterion — a model that misses severe DILI cases is not a pharmacovigilance tool

Results

→ View the full model leaderboard — complete contributor rankings, per-class recall, and calibration plots across all submissions.

Contributor	Claimed F1	Out-of-the-Box	After Fine-tuning	Severe DILI Recall
Contributor A	0.82	0.64	0.71	0.51
Contributor B ✅	0.79	0.68	0.81	0.73
Contributor C ⚠️	0.85	0.72	0.77	0.44

What the numbers reveal:

Contributor B improved beyond its claimed macro-F1 after fine-tuning on the independent cohort, reaching 0.81 while achieving the strongest severe DILI recall in the evaluation at 0.73. It started at 0.68 out-of-the-box — a meaningful generalisation gap from its claimed 0.79, but not a collapse — and the fine-tuning process on 1,280 real-distribution patients brought it to the strongest all-round performance, with its severe case recall driven by the model's ability to pick up the cumulative dose and ALT trajectory signals that distinguish Class 3 from Class 2 in this population.

Contributor C had the highest claimed F1 at 0.85 and the strongest out-of-the-box performance at 0.72. After fine-tuning it reached 0.77 — a reasonable result, but its severe DILI recall of 0.44 means it misses more than half of all severe cases. On a drug in Phase III with a hepatotoxic profile, a pharmacovigilance model that detects 44% of severe liver injury events is a safety liability, not a safety tool.

Contributor A showed the largest claimed-to-actual gap: 0.82 claimed, 0.64 on the independent cohort before adaptation. After fine-tuning it recovered to 0.71, but its severe recall of 0.51 — catching roughly one in two severe cases — remains below the minimum threshold for deployment in a clinical safety monitoring context.

Business Impact

Illustrative assumptions: Phase III trial with 800 patients on a hepatotoxic compound / Undetected severe DILI rate estimated at 4.7% of exposed population / Cost of one undetected severe DILI event: regulatory delay, programme review, patient harm exposure — estimated €500K–€2M per episode / Regulatory submission delay from insufficient generalisation evidence: 12–18 months

Strategy	Severe DILI Recall	Estimated Missed Cases (800 patients)	Regulatory Validation Evidence	Submission Risk
Unvalidated internal model	Unknown	Unknown	Single cohort only	High
Contributor A	0.51	~18 missed	Partial — moderate generalisation gap	Moderate
Contributor B ✅	0.73	~10 missed	Strong — cross-cohort evidence	Low
Contributor C	0.44	~21 missed	Partial — high severe miss rate	High

The regulatory value of this evaluation is not primarily financial — it is evidentiary. A generalisation result documented on an independent patient cohort, achieved through a validated federated evaluation methodology, constitutes the kind of external validation evidence that can accompany a safety biomarker submission to EMA or FDA. Contributor B provides that evidence. Running the same evaluation through a conventional data transfer would have required six to twelve months of governance, patient consent renegotiation, and legal review — and the data would have left the institution.

Decision

Marcus selects Contributor B's model for the safety submission package, with the independent cohort fine-tuning result constituting the external validation evidence required for regulatory documentation. The feature importance output from the fine-tuning run confirms that cumulative dose, ALT at peak exposure, and bile acid elevation are the dominant predictors of severe injury in the independent population — consistent with the pharmacological mechanism, which strengthens the submission narrative.

The tracebloc workspace stays active after the initial evaluation. As the institution adds patients to its retrospective cohort, or as new therapeutic programmes require DILI model validation, the same infrastructure can be reused. New model versions from Contributor B — or candidate models from other partners — can be evaluated on the same holdout set without rebuilding the evaluation pipeline. The leaderboard records which models hold up on independent data and which do not, turning a one-off regulatory requirement into ongoing pharmacovigilance model governance.

Explore this use case further:

View the model leaderboard — full contributor rankings, per-class recall, calibration plots
Explore the dataset — DILI class distribution, metabolomic feature statistics, dose-response patterns
Start training — submit your own DILI prediction model to this evaluation

Related use cases: See how the same generalisation validation approach applies to liquid biopsy classification in paediatric rare disease and omics biomarker panel narrowing across rare disease cohorts. For a broader view of what federated learning applications look like in preclinical toxicology and pharmacovigilance, see our guide.

Deploy your workspace or schedule a call.

Disclaimer

Disclaimer: The dataset used in this use case is augmented — constructed to reflect the statistical structure of real-world hepatotoxicity safety cohorts, including DILI severity class distribution, clinical liver function marker distributions, metabolomic feature patterns, and dose-response relationships, without containing any identifiable patient information. The persona, contributor names, claimed performance figures, business impact assumptions, and regulatory scenario are illustrative and based on patterns observed across pharmaceutical safety and clinical data science environments. They do not represent any specific company, drug programme, institution, or regulatory submission.