
Safety Metabolomics: External Validation of Drug Induced Liver Injury Biomarkers
Participants
7
End Date
01.04.26
Dataset
dvsfxok7
Resources2 CPU (8.59 GB) | 1 GPU (22.49 GB)
Compute
0 / 100.00 PF
Submits
0/5

7
01.04.26
On this page
Metabolomic Biomarker Validation for Hepatotoxicity Severity Classification
Drug induced liver injury (DILI) is one of the leading causes of drug withdrawal and clinical trial failure. Pharma and biotech companies routinely develop internal safety biomarker models trained on their own trial data, but these models are rarely validated on independent external cohorts. Without external validation, there is no evidence that the biomarker panel or model generalizes beyond the original trial population, and regulatory agencies increasingly expect this evidence.
tracebloc provides secure access to an independent metabolomics safety dataset held at a clinical institution, enabling researchers to re run their internally trained models on external data without the data leaving the hospital. This allows two critical validation steps: confirming whether previously selected biomarker features remain relevant in an independent cohort, and assessing whether model performance is consistent with what was observed on internal data.
To be completed after evaluation concludes.
SCIVIAS: Seeing Childhood Illness through Multi Omics
SCIVIAS is a monocentric observational study conducted at the Dr. von Hauner Children’s Hospital, LMU Munich, led by Prof. Dr. Dr. Christoph Klein. The study combines retinal imaging (fundus photography, OCT) with multi omics profiling (genome, transcriptome, proteome, metabolome) to identify early diagnostic markers for rare and chronic childhood diseases.
The core premise: children with rare diseases are often diagnosed only when complications arise. SCIVIAS aims to change this by integrating pattern recognition on retinal images with multi layer omics data, using machine learning to detect disease signatures before clinical manifestation. All omics data and retinal images are pseudonymized and processed through ML algorithms, comparing data both within defined disease groups and across phenotypes to uncover pleiotropic factors.
The cohort consists of 2500 patients and covers 13 therapeutic areas including IBD (Crohn’s, ulcerative colitis, celiac disease), cystic fibrosis, Duchenne muscular dystrophy, spinal muscular atrophy, and other rare pediatric conditions.
Ethics approval: LMU Munich, approval no. 17–801. German Clinical Trials Register: DRKS00013306.
Study page: https://www.ccrc-hauner.de/clinical-research/scivias-study
For this challenge, the metabolomics and clinical safety monitoring layer of the SCIVIAS cohort provides the foundation. The dataset captures drug exposure, standard liver function markers, cardiac safety markers, and metabolomic profiles (lipids, bile acids, and related metabolites) alongside a severity graded safety outcome label. This combination of clinical chemistry and metabolomics makes the dataset uniquely suited for validating DILI biomarker models that go beyond conventional liver enzyme panels.
Hepatotoxicity remains one of the most common reasons for drug attrition. Standard liver function tests (ALT, AST, ALP, bilirubin) are well established but have known limitations: they are late indicators, lack specificity for mechanism, and do not reliably distinguish mild transient elevations from signals of serious injury. Pharma and biotech companies have responded by developing metabolomic safety biomarker panels that combine traditional liver chemistry with lipid profiles, bile acid measurements, and drug exposure variables to improve early detection and severity grading of DILI.
The problem is validation. A biomarker panel developed on one company’s internal trial data may perform well in that specific population, drug, and dosing regimen, but there is no guarantee it generalizes. Regulatory submissions increasingly require evidence of external validity: does the panel work on an independent cohort, with different patient characteristics, different drug exposures, and different metabolomic measurement conditions? Without access to external safety datasets, this question goes unanswered.
This challenge is designed for a specific scenario: a pharma company has already trained a DILI severity classification model on its own clinical trial data. The model uses a combination of liver enzymes, metabolomic features, and drug exposure variables to classify patients into severity grades. The company now needs to answer two questions on an independent external dataset:
1. Feature relevance: Are the biomarker features that were selected as predictive on internal data still informative in the external cohort? If the model relied on specific lipid species or bile acid ratios, do those same features carry signal in a different patient population?
2. Performance consistency: Does the model achieve comparable classification performance (as measured by log loss) on external data? A significant performance drop indicates overfitting to the original cohort or population specific effects that limit generalizability.
This is not exploratory biomarker discovery. It is a validation exercise where the model and feature set already exist, and the question is whether they hold up externally.
Researchers work with a safety metabolomics dataset (1600 samples, 203 features) that combines four layers of information:
Drug exposure variables: dose, cumulative dose, days on drug, and weeks on treatment capture the pharmacokinetic context of each patient’s safety profile.
Standard clinical chemistry: conventional liver function markers (ALT, AST, alkaline phosphatase, GGT, bilirubin) alongside cardiac safety markers (troponin, creatine kinase, LDH) and patient demographics (age, weight, diabetes status).
Metabolomic profiles: lipid species, bile acid levels, and additional metabolomic measurements that extend beyond what standard liver panels capture. These metabolomic features are where novel biomarker signal is most likely to differentiate from conventional approaches.
The classification target is a four class severity label. Feature names in the metabolomics layers are anonymized to protect the underlying clinical data structure.
External validation of safety biomarker models requires access to an independent dataset with comparable features: liver chemistry, metabolomics, drug exposure, and severity grading. These datasets are rare because they sit inside hospitals and clinical trials, protected by strict data governance. Transferring patient level safety data externally is a regulatory and compliance challenge, particularly when metabolomic profiles in combination with drug exposure and demographics create re identification risk. tracebloc resolves this: researchers submit their pre trained model, it runs on the external dataset inside the hospital’s infrastructure, and only aggregate validation metrics are returned. The data never moves.
Four class classification: predict the severity grade of drug induced liver injury from a combination of drug exposure, clinical chemistry, and metabolomic features. The task is framed as external validation: researchers bring a model (or model architecture) developed on their own internal data and evaluate whether it generalizes to this independent cohort. Researchers may also train new models directly on this dataset to benchmark against their internal results.
Logarithmic Loss (cross entropy loss). Lower is better. Log loss is the appropriate metric for this task because it evaluates both classification accuracy and probability calibration across all four severity classes. In a safety context, well calibrated probability estimates matter: a model that assigns 85% probability to "no injury" when the true state is severe hepatotoxicity is penalized heavily. This reflects the clinical reality that underestimating severity is far more dangerous than overestimating it, and the logarithmic penalty structure ensures that confident wrong predictions on the rare severe cases dominate the score.
203 features across 1600 samples.
| Feature Block | Count | Notes |
|---|---|---|
| Drug exposure | 4 | Dose, cumulative dose, days on drug, weeks on treatment. |
| Patient demographics | 3 | Age, weight, diabetes status. |
| Liver function markers | 5 | ALT, AST, alkaline phosphatase, GGT, bilirubin. |
| Cardiac safety markers | 3 | Troponin, creatine kinase, LDH. |
| Lipid metabolomics | ~60 | Continuous. Lipid species measurements. |
| Bile acid metabolomics | ~15 | Continuous. Bile acid levels. |
| Other metabolomics | ~10 | Continuous. Additional metabolomic measurements. |
Heavily imbalanced. The majority of patients fall into the lowest severity grade (approximately 90%), with progressively fewer patients in the higher severity grades. The most severe class represents roughly 1% of the dataset. This imbalance is realistic for clinical safety data, where serious adverse events are rare, and it is the central modeling challenge: the rare severe cases are the ones that matter most clinically, but they are hardest to detect statistically.
External validation of safety biomarker models requires a controlled environment where the same evaluation metric, the same data, and the same compute constraints are applied to every model. Without this, performance claims from internal validation are unverifiable. tracebloc provides the infrastructure for this: researchers submit models developed on their own data, those models execute on the external dataset inside the hospital, and log loss is computed under standardized conditions. This produces the kind of reproducible, auditable external validation evidence that regulatory submissions require.
tracebloc provides secure access to clinical safety data held at hospitals. Researchers interact through a controlled environment where they receive exploratory data analysis outputs to understand the external dataset, then submit model code that executes on the institution’s infrastructure. Raw patient data never leaves the hospital. Model weights are not extractable. Only aggregate performance metrics are returned.
Primary: log loss on the four class severity label. For external validation, the key comparison is between the log loss achieved on internal data and the log loss achieved here. A consistent score indicates generalizability. A significant degradation indicates population specific effects or overfitting.
Compute efficiency within the allocated budget. Pre trained models being validated should require minimal compute. Researchers training new models from scratch on this dataset face the standard resource allocation trade off.
The core validation questions: Do the metabolomic features that were predictive on internal data remain informative in this independent cohort? Does the model maintain its ability to detect the rare severe cases (classes 2 and 3) despite the heavy class imbalance? And critically, does the metabolomic layer add value over standard liver chemistry alone? A model that validates well on the external cohort with its full feature set, but loses performance when restricted to conventional markers, provides direct evidence for the clinical utility of metabolomic safety biomarkers.
To be completed after evaluation concludes.
To be completed after evaluation concludes.