
Liquid Biopsy Proteomics: Non Invasive Classification in Pediatric Rare Disease
Participants
8
End Date
01.04.26
Dataset
d4onaj8u
Resources2 CPU (8.59 GB) | 1 GPU (22.49 GB)
Compute
0 / 100.00 PF
Submits
0/5

8
01.04.26
On this page
Blood Based Proteomic Screening as an Alternative to Invasive Diagnostics
Diagnosing rare diseases in children often requires invasive procedures: tissue biopsies, bone marrow aspirates, lumbar punctures, or endoscopies. These are painful, carry procedural risk, require sedation or general anesthesia in young children, and are difficult to repeat for monitoring. If circulating protein signatures in blood could classify disease state with sufficient accuracy, many of these invasive procedures could be deferred or reserved for confirmation only, reducing burden on patients and families while accelerating the diagnostic pathway.
tracebloc provides secure access to a pediatric liquid biopsy proteomics dataset held at a clinical institution, enabling researchers to build and validate blood based classification models without the data leaving the hospital. The dataset combines standard blood protein measurements with specialized circulating disease markers and clinical variables, providing the feature space needed to evaluate whether non invasive proteomic profiling can substitute for invasive diagnostic procedures.
To be completed after evaluation concludes.
SCIVIAS: Seeing Childhood Illness through Multi Omics
SCIVIAS is a monocentric observational study conducted at the Dr. von Hauner Children’s Hospital, LMU Munich, led by Prof. Dr. Dr. Christoph Klein. The study combines retinal imaging (fundus photography, OCT) with multi omics profiling (genome, transcriptome, proteome, metabolome) to identify early diagnostic markers for rare and chronic childhood diseases.
The core premise: children with rare diseases are often diagnosed only when complications arise. SCIVIAS aims to change this by integrating pattern recognition on retinal images with multi layer omics data, using machine learning to detect disease signatures before clinical manifestation. All omics data and retinal images are pseudonymized and processed through ML algorithms, comparing data both within defined disease groups and across phenotypes to uncover pleiotropic factors.
The cohort consists of 2500 patients and covers 13 therapeutic areas including IBD (Crohn’s, ulcerative colitis, celiac disease), cystic fibrosis, Duchenne muscular dystrophy, spinal muscular atrophy, and other rare pediatric conditions.
Ethics approval: LMU Munich, approval no. 17–801. German Clinical Trials Register: DRKS00013306.
Study page: https://www.ccrc-hauner.de/clinical-research/scivias-study
For this challenge, the proteomic layer of the SCIVIAS cohort provides the foundation. The dataset captures blood based protein profiles from pediatric patients across multiple rare disease groups, combining bulk circulating proteins with specialized disease markers measured from standard blood draws. This non invasive sampling approach is central to the study’s goal of identifying diagnostic signatures that can be detected before clinical symptoms manifest or without requiring tissue sampling.
The diagnostic journey for children with rare diseases is notoriously long: five years on average from first symptoms to confirmed diagnosis. A major reason is that definitive diagnosis often requires invasive tissue sampling, which clinicians are reluctant to perform in young children without strong clinical suspicion. This creates a catch 22: you need symptoms severe enough to justify the procedure, but by that point the disease has already progressed.
Blood based diagnostics could break this cycle. A simple blood draw is routine, low risk, repeatable, and acceptable to families. If proteomic signatures in blood can classify disease state with sufficient accuracy, invasive procedures could be reserved for confirmation rather than used as first line diagnostics. For pharma and biotech companies developing therapies for pediatric rare diseases, this has direct implications: earlier diagnosis means earlier trial enrollment, larger eligible populations, and the possibility of treating before irreversible damage occurs.
Liquid biopsy refers to the analysis of biomarkers circulating in blood (or other body fluids) rather than obtained from tissue. In the context of rare disease, the relevant circulating analytes are proteins: disease specific markers released into the bloodstream by affected tissues, inflammatory mediators, and pathway activity indicators that reflect underlying pathology without requiring direct tissue access.
The dataset in this challenge contains two distinct types of protein features. The first is a broad panel of circulating proteins measured from blood, representing the general proteomic landscape. The second is a set of specialized disease markers, circulating biomarkers that are elevated in specific disease states but baseline in healthy individuals. These marker features have a characteristic statistical profile: most patients show low values, but a subset shows strongly elevated readings. This pattern is exactly what makes them diagnostically useful: the elevation is the signal.
The modeling challenge is to combine these two layers, general proteomic background plus specialized markers plus clinical variables, into a classifier that can identify disease state from a blood sample alone.
Researchers work with a liquid biopsy proteomics dataset (1200 samples, 153 features) derived from the SCIVIAS cohort. The dataset contains three feature blocks: bulk circulating protein levels, specialized disease markers, and clinical measurements. Feature names are anonymized. A patient identifier is included. Three categorical features are present, one of which is the classification target.
The specialized marker features behave differently from the bulk proteins: they show highly skewed distributions with heavy right tails, meaning most patients have near baseline values while a small subset shows strongly elevated readings. This structure is characteristic of circulating disease markers and is core to the classification task.
Developing blood based diagnostic classifiers for rare diseases requires large, well characterized pediatric cohorts with both proteomic profiling and confirmed diagnoses. These datasets sit inside academic medical centers, protected by strict data governance. Pharma and biotech companies developing companion diagnostics or screening tools need to validate their classifiers on independent data, but cannot access it through traditional data transfer agreements without significant compliance risk, particularly in pediatric rare disease populations where proteomic profiles combined with clinical variables create re identification risk. tracebloc provides the bridge: researchers access the data securely, submit models that execute inside the hospital, and receive only aggregate performance metrics.
This challenge supports two distinct use cases, and researchers can pursue either or both:
Standalone discovery: Researchers who do not have an existing model can use this dataset to build a blood based classifier from scratch. The combination of bulk circulating proteins, specialized disease markers, and clinical variables provides a rich feature space for training new models and identifying which protein features carry diagnostic signal in a pediatric rare disease population.
External validation: Researchers who have already developed a proteomic classifier on their own internal data can use this dataset as an independent validation cohort. The two key validation questions are the same as in any external validation exercise: do the features that were predictive on internal data remain informative in this independent cohort, and does the model achieve comparable performance? A classifier that validates well on this external pediatric dataset provides evidence of generalizability that is difficult to obtain through any other pathway, given the rarity of accessible pediatric proteomic cohorts with confirmed diagnoses.
Classification from blood based proteomic features: predict disease state using a combination of circulating protein levels, specialized disease markers, and clinical variables. The task evaluates whether a non invasive blood sample provides sufficient information to classify patients accurately enough to approach diagnostic grade performance, potentially deferring invasive procedures to a confirmation role. Researchers must handle the mixed feature structure: normally distributed bulk proteins alongside highly skewed disease markers that carry signal in their tail behavior.
Mean Squared Error (MSE). Lower is better. MSE penalizes confident wrong predictions quadratically, making it sensitive to cases where the model is certain but incorrect. In a diagnostic screening context, this is appropriate: a classifier that confidently assigns "healthy" to a patient with elevated disease markers is more dangerous than one that expresses uncertainty.
153 features across 1200 samples.
| Feature Block | Count | Notes |
|---|---|---|
| Circulating proteins | 100 | Continuous. Bulk blood protein levels representing the general proteomic landscape. |
| Specialized disease markers | 20 | Continuous. Circulating biomarkers with highly skewed distributions: baseline in most patients, strongly elevated in a subset. These carry the primary diagnostic signal. |
| Clinical measurements | ~30 | Continuous. Clinical phenotype variables. |
Additionally: a patient identifier and a classification target (categorical).
Researchers should examine the target variable distribution as a first step in their exploratory analysis.
Non invasive diagnostic classifiers must meet a high bar before they can replace established invasive procedures. Performance claims need to be validated on independent data under standardized conditions. tracebloc provides this: researchers build or validate classifiers on an external pediatric cohort, with standardized evaluation metrics, inside a secure environment. The resulting performance evidence is reproducible, auditable, and applicable to regulatory or clinical decision making about whether a blood based test is ready for deployment.
tracebloc provides secure access to clinical proteomic data held at hospitals. Researchers interact through a controlled environment where they receive exploratory data analysis outputs to understand the dataset, then submit model code that executes on the institution’s infrastructure. Raw patient data never leaves the hospital. Model weights are not extractable. Only aggregate performance metrics are returned.
Primary: MSE on the classification target. For a non invasive screening application, both sensitivity (detecting disease when present) and specificity (avoiding false positives that trigger unnecessary invasive follow up) matter. MSE captures the overall calibration of predicted probabilities across all classes.
Compute efficiency within the allocated budget. The dataset is moderately sized (1200 samples) with manageable dimensionality (153 features), so most standard architectures will fit within the compute constraints. The trade off is between model complexity and the risk of overfitting on the specialized marker features, which have heavy tailed distributions that can dominate gradient based optimization.
The central question: can a blood draw replace an invasive procedure? If a classifier built on circulating proteins and disease markers achieves performance comparable to tissue based diagnostic standards, this validates the liquid biopsy approach for pediatric rare disease. The secondary question is feature attribution: which proteins and markers drive the classification? If a small panel of circulating biomarkers accounts for most of the predictive signal, this defines a focused assay that could be deployed as a point of care screening test, making non invasive rare disease screening practical and scalable in clinical settings.
To be completed after evaluation concludes.
To be completed after evaluation concludes.