
Predictive Patient Stratification in Pediatric Bleeding Disorders
Participants
8
End Date
01.04.26
Dataset
d1133xgc
Resources2 CPU (8.59 GB) | 1 GPU (22.49 GB)
Compute
0 / 100.00 PF
Submits
0/5

8
01.04.26
On this page
Bleeding disorders in children, including hemophilia A, hemophilia B, von Willebrand disease, and thrombophilia, are clinically heterogeneous. Patients with the same diagnosis can respond very differently to the same therapy. Gene therapy developers and rare disease focused biotechs need to stratify pediatric patients into clinically meaningful subgroups before trial enrollment, but the multi omics data required for this does not exist commercially.
A secure evaluation platform is required to benchmark patient stratification models on real pediatric multi omics data without the data leaving the clinical institution. tracebloc enables participating teams to submit classification models that integrate genomic, transcriptomic, proteomic, and clinical features, evaluating which architectures and integration strategies produce the most reliable stratification in a controlled, reproducible environment.
To be completed after evaluation concludes.
SCIVIAS: Seeing Childhood Illness through Multi Omics
SCIVIAS is a monocentric observational study conducted at the Dr. von Hauner Children’s Hospital, LMU Munich, led by Prof. Dr. Dr. Christoph Klein. The study combines retinal imaging (fundus photography, OCT) with multi omics profiling (genome, transcriptome, proteome, metabolome) to identify early diagnostic markers for rare and chronic childhood diseases.
The core premise: children with rare diseases are often diagnosed only when complications arise. SCIVIAS aims to change this by integrating pattern recognition on retinal images with multi layer omics data, using machine learning to detect disease signatures before clinical manifestation. All omics data and retinal images are pseudonymized and processed through ML algorithms, comparing data both within defined disease groups and across phenotypes to uncover pleiotropic factors.
The cohort consists of 2500 patients and covers 13 therapeutic areas including IBD (Crohn’s, ulcerative colitis, celiac disease), cystic fibrosis, Duchenne muscular dystrophy, spinal muscular atrophy, and other rare pediatric conditions.
Ethics approval: LMU Munich, approval no. 17–801. German Clinical Trials Register: DRKS00013306.
Study page: https://www.ccrc-hauner.de/clinical-research/scivias-study
For this challenge, the hemostaseology subset of SCIVIAS is the focus: pediatric patients with hemophilia A, hemophilia B, von Willebrand disease, and thrombophilia. A substantial proportion of patients are genetically confirmed, providing high diagnostic certainty. The cohort is longitudinal (baseline plus 2 year post therapy), and all five omics layers are available. Feature names in the challenge dataset are anonymized to protect the underlying clinical data structure.
Pediatric bleeding disorders span a wide clinical spectrum. Two children with hemophilia A can have fundamentally different molecular profiles, different comorbidity patterns, and different responses to factor replacement or gene therapy. Current clinical practice stratifies patients primarily by factor activity levels and bleeding phenotype, but these clinical markers alone are poor predictors of long term treatment response, particularly for gene therapy durability and inhibitor development risk.
The question pharma and biotech companies face: can multi omics data stratify patients into subgroups with distinct treatment trajectories? If so, this directly informs trial design (enrichment strategies, endpoint selection), companion diagnostic development, and eventually clinical decision making. The challenge is that the data required to answer this question, pediatric multi omics with longitudinal treatment follow up across multiple bleeding disorder subtypes, does not exist in any commercially accessible form.
Single omics approaches capture one layer of biology. Genomic mutations define the genetic basis of disease but say nothing about how the mutation manifests at the protein or metabolic level. Transcriptomics captures gene activity but misses post translational regulation. Proteomics sees the functional molecules but not their genetic drivers. Metabolomics reflects downstream pathway activity but not upstream causes.
Effective patient stratification in heterogeneous diseases like bleeding disorders requires integrating across these layers. A patient subgroup defined by the convergence of a specific mutation pattern, a transcriptomic signature, and a proteomic profile is biologically more meaningful (and clinically more actionable) than one defined by any single layer alone. Models that can identify these multi layer subgroups produce stratifications that are more likely to replicate in independent cohorts and more likely to predict treatment response.
Features that consistently drive stratification across multiple omics layers also become candidates for downstream biomarker validation and, where biological plausibility supports it, target discovery. The stratification task is the computational foundation on which these higher order questions are built.
Participants work with a multi omics dataset (480 samples, 282 features) derived from the SCIVIAS hemostaseology cohort. The dataset contains four feature blocks representing genomic mutations, gene expression levels, protein measurements, and clinical phenotype variables. Feature names are anonymized (mutation_0, gene_0, protein_0, clinical_0, etc.) to protect the underlying clinical data structure. The four feature blocks correspond to real molecular and clinical measurements from the original cohort.
The classification target is a binary label. The dataset is complete with no missing values.
Patient stratification using multi omics data requires both the data and the analytical infrastructure to be in the same place. Pharma and biotech companies have the ML capacity but lack pediatric multi omics cohorts. Academic medical centers have the cohorts but lack systematic model benchmarking infrastructure. Transferring de identified multi omics data externally creates re identification risk, especially in rare disease populations where patient profiles are inherently distinctive. tracebloc resolves this: models travel to the data, the data stays at the institution, and evaluation happens inside a standardized, auditable environment.
Binary classification on integrated multi omics features. The model receives genomic mutation profiles, gene expression levels, protein measurements, and clinical variables, and must predict the binary outcome label. The challenge tests two capabilities simultaneously: raw predictive performance (can you classify accurately?) and multi omics integration (does combining layers outperform single layer models?).
Mean Squared Error (MSE). Lower is better. For binary classification, MSE penalizes confident wrong predictions quadratically, making it more sensitive to catastrophic misclassifications than log loss. This is appropriate for a clinical stratification task where confidently placing a patient in the wrong subgroup has real consequences.
For a detailed overview, see EDA section.
282 features across 480 samples. PCA shows 178 components are needed to capture 90% of variance, indicating high effective dimensionality. The feature to sample ratio is manageable but regularization remains important.
| Feature Block | Count | Notes |
|---|---|---|
| Mutations | 80 | Binary (0/1). Genomic variant profiles. |
| Gene expression | ~110 | Continuous. Transcriptomic features. |
| Protein levels | ~50 | Continuous. Proteomic measurements. |
| Clinical phenotype | 20 | Continuous. Clinical measurements. |
Approximately balanced: the two classes split roughly 50/50.
Multi omics integration is methodologically open: early fusion, late fusion, stacking, attention based architectures, and tree ensembles all represent viable strategies with different trade offs. Without a controlled benchmarking environment, it is impossible to determine whether one approach genuinely outperforms another or whether differences reflect data handling, compute allocation, or random variation. tracebloc standardizes the evaluation surface so that performance differences are attributable to modeling choices alone.
tracebloc operates as a secure AI evaluation platform. Participating teams interact through a controlled API. They receive exploratory data analysis outputs to understand the dataset, then submit model code that executes inside the tracebloc infrastructure. Raw patient data never leaves the secure environment. Model weights are not extractable. Only aggregate performance metrics are returned.
Primary: MSE on the binary classification target. Models are ranked on the leaderboard by this metric.
Compute efficiency within the 100 PF budget. Multi omics integration models can be computationally expensive, particularly deep learning approaches. Resource aware architecture selection is part of the challenge.
Beyond raw MSE, the scientifically interesting question is whether models that integrate across all four feature blocks outperform those that rely on a single layer. If multi omics integration consistently improves stratification, this validates the premise that cross layer signal matters for bleeding disorder subtyping, and the features driving that improvement become candidates for downstream biomarker validation and target discovery work.
To be completed after evaluation concludes.
To be completed after evaluation concludes.