Liquid Biopsy Proteomics for Non-Invasive Paediatric Rare Disease

Participants

End Date

30.11.26

Dataset

d4onaj8u

Resources2 CPU (8.59 GB) | 1 GPU (22.49 GB)

Compute

0 / 100.00 PF

Submits

0/5

On this page

Overview

About this use case: A paediatric hospital holds 1,200 rare disease liquid biopsy samples that took eight years to assemble — a dataset pharma and diagnostics companies would pay to train on, except that GDPR and re-identification risk make a raw data transfer impossible. tracebloc lets companies train classification models inside the hospital's infrastructure and generates recurring access-fee revenue without a single patient record ever leaving. Explore the data, submit your own model, and see how your approach compares.

Problem

The Universitäts-Kinderklinik Hartenberg holds the most comprehensive liquid biopsy dataset for paediatric rare disease in Europe: 1,200 blood-based proteomic samples spanning circulating tumour markers, broad protein panels, and clinical variables from children across multiple rare disease groups. Pharmaceutical companies developing non-invasive diagnostics need to train classification models on this data. Ethics approval, GDPR compliance, and re-identification risk mean they cannot access a single patient record directly.

Solution

Prof. Dr. Elisabeth Hartmann, Director of Clinical Research Data, deploys a tracebloc workspace loaded with 1,200 paediatric liquid biopsy samples. Pharmaceutical and diagnostics companies submit their classification models to the workspace. Inside tracebloc's containerised training environment, models train on the patient sample set — fine-tuning weights to the skewed marker distributions and protein expression patterns specific to paediatric rare disease — without the data ever leaving the hospital's infrastructure. tracebloc orchestrates training, scores each adapted model against the evaluation cohort, and publishes results to a live leaderboard automatically. This is a federated learning application of data monetisation: the institution controls the asset, the contributors train on it, and not one patient record moves. Each contributor pays a training access fee. The institution generates recurring research revenue from data it already owns.

Outcome

In this example evaluation, contributors range from diagnostics firms validating existing blood-based assay models against a paediatric rare disease cohort to pharma companies building novel cell-free DNA analysis classifiers from scratch. The workspace surfaces which approaches handle the skewed marker distributions effectively and which collapse when rare-disease-elevated markers dominate predictions. The tracebloc workspace stays in place as the institution's liquid biopsy cohort grows, enabling continuous model improvement without renegotiating access.

The Operational Challenge

Prof. Hartmann's institution has been running the PEDIOLIX cohort study for eight years. The liquid biopsy layer of the cohort — blood-based proteomic samples from children across 13 therapeutic areas — is the result of a data collection programme that no pharma company could replicate independently. It required ethics approval, patient consent infrastructure for a paediatric population, multi-year longitudinal sampling, and a clinical team experienced in paediatric rare disease phenotyping. The data exists because a hospital built it. It is not a dataset any company can buy.

The diagnostic gap it addresses is substantial. For most paediatric rare diseases, the path to diagnosis still runs through invasive procedures: tissue biopsies, bone marrow aspirates, lumbar punctures. These are painful, carry procedural risk, and are difficult to justify in young children without strong clinical suspicion. The average rare disease diagnosis takes five years from first symptoms. In that window, children are treated empirically, disease progresses, and families carry a burden of uncertainty that compounds the clinical problem.

Blood-based biomarkers — blood based biomarkers circulating in plasma after tissue release — could transform this pathway. If proteomics from a standard blood draw can classify disease state accurately enough, invasive procedures can be reserved for confirmation rather than used as first-line diagnostics. Several pharma and diagnostics companies are actively developing liquid biopsy classifiers for paediatric rare disease. Every one of them needs access to an external paediatric cohort to validate or train their model. None of them can replicate the PEDIOLIX sample base.

The commercial interest is real and growing. Companies are willing to pay for training access to this cohort. The institution's problem is not demand — it is compliance. A data transfer agreement covering 1,200 paediatric patients with rare disease diagnoses, proteomic profiles, and clinical variables would require months of legal review, GDPR impact assessments, and data protection board sign-off. And at the end of that process, the institution would have given away its most valuable asset. The dataset — once transferred — is no longer under the institution's control. It can be copied, redistributed, and used beyond any agreement's scope.

The institution's research ethics board and data protection officer have been clear: no raw transfer. But the institution also cannot afford to leave recurring commercial revenue on the table. Research budgets are under pressure. Clinical data assets that took eight years to build deserve to generate ongoing value.

Stakeholders

Prof. Dr. Elisabeth Hartmann, Director of Clinical Research Data: Owns the data governance strategy and the commercial licensing framework. KPIs: revenue per access period, data governance compliance, publication output from trained models. Needs a monetisation model that does not require raw data transfer.
Data Protection Officer: Responsible for GDPR compliance across all research data uses. The paediatric rare disease cohort carries elevated re-identification risk — disease phenotype combined with proteomic profile creates a quasi-unique signature in small populations. No raw transfer under any commercial agreement.
Research Ethics Board: Ethics approval covers the cohort's use for academic and translational research. Commercial use by external pharma companies requires explicit scope review. Training access through a controlled environment — where the institution maintains full data custody — is within scope; data transfer is not.
Head of Bioinformatics: Technical lead for the PEDIOLIX data pipeline. Owns the exploratory data analysis, feature engineering, and the tracebloc workspace configuration. Monitors contributor training runs for anomalous access patterns.
Pharma / Diagnostics Contributor (VP Clinical Diagnostics): Needs access to an independent paediatric liquid biopsy cohort to validate or extend their non-invasive classifier. Willing to pay for controlled training access. Evaluation data has to be genuinely independent — not a dataset curated to match their internal model's training distribution.

The Underlying Dataset

The evaluation dataset contains 1,200 paediatric liquid biopsy samples from children across multiple rare disease groups. Full dataset statistics, feature distributions, and marker behaviour are available in the Exploratory Data Analysis tab.

This dataset is augmented. It was constructed to reflect the statistical structure of a real-world paediatric rare disease liquid biopsy cohort — the circulating marker distributions, the broad protein expression ranges, and the skew profiles characteristic of disease-state versus baseline biomarker behaviour — without containing any identifiable patient records, diagnoses, or hospital data.

Property	Value
Total samples	1,200
Features	153
Circulating proteins	~100 (continuous, bulk blood protein levels)
Specialised disease markers	~20 (continuous, highly skewed — baseline in most, elevated in disease-state subset)
Clinical variables	~30 (continuous, clinical phenotype measurements)
Categorical features	3 (including patient identifier and classification target)
Missing values	None
Zero inflation	None across all examined features
Evaluation metric	MSE

A note on the marker features: The specialised disease markers show highly skewed distributions with heavy right tails. Most patients show near-baseline values; a subset shows strongly elevated readings. This is not a data quality issue — it is the diagnostic signal. Disease-state elevation in these markers is what makes liquid biopsy diagnostically useful. Models that cannot handle right-skewed, sparse biomarker distributions will underperform relative to those that do.

How Evaluation Works

Each contributor submitted their non-invasive classification model to the tracebloc workspace. The evaluation ran in two phases.

Phase 1 — Out-of-the-box performance. Each model was scored as submitted, with no adaptation to this paediatric cohort. This establishes what the system actually delivers when applied to a new patient population — typically the number that diverges most sharply from what contributors claim in their proposals.

Phase 2 — Fine-tuning. Contributors were given access to the training environment inside the tracebloc workspace. Each contributor transferred their model into the workspace and ran training on the 1,200-sample patient cohort. This fine-tuned the model weights to the specific marker distributions, protein expression patterns, and clinical variable ranges of the paediatric rare disease population — adapting from a generalised classifier to one calibrated for this cohort. After training, the adapted model was evaluated automatically against the held-out evaluation set. The patient data never left the hospital's infrastructure. Contributors received only their own results; no contributor had visibility into another's training runs or scores before the leaderboard published.

Each contributor received:

Training access: 1,200 liquid biopsy samples (153 features, paediatric rare disease cohort) for model fine-tuning inside the workspace
Evaluation environment: Sandboxed execution — adapted models run against the evaluation cohort, no patient data export path available
Metrics tracked: MSE on the classification target, performance breakdown by feature block (marker-driven vs. broad protein-driven vs. clinically-augmented), and feature attribution outputs for companion diagnostic development
Key modelling challenge: Handling the right-skewed marker distributions — contributors whose models assume normally distributed inputs typically show the largest out-of-the-box degradation and the largest fine-tuning recovery

Results

→ View the full model leaderboard — complete contributor rankings, MSE breakdown, and marker feature attribution across all submissions.

Contributor	Claimed MSE	Out-of-the-Box	After Fine-tuning	Marker-Only MSE
Contributor A	0.11	0.19	0.14	0.22
Contributor B ✅	0.13	0.15	0.11	0.13
Contributor C ⚠️	0.10	0.24	0.18	0.31
Contributor D	0.14	0.21	0.16	0.25

What the numbers reveal:

Contributor B is the only model to improve beyond its own claimed MSE after fine-tuning on this paediatric cohort — moving from 0.13 claimed to 0.11 post-fine-tuning. More tellingly, its marker-only MSE of 0.13 is the strongest in the evaluation: this model has learned to extract diagnostic signal from the skewed, sparse circulating marker features that define liquid biopsy performance. Its architecture handles right-tailed distributions natively rather than transforming them away.

Contributor C entered with the strongest claimed MSE at 0.10 — the most optimistic proposal in the evaluation. On this paediatric cohort, it delivered 0.24 out-of-the-box and recovered only to 0.18 after fine-tuning. Its marker-only MSE of 0.31 indicates that the model is essentially ignoring the disease markers and relying on the broader protein panel, which carries less disease-specific signal in this population. A claimed MSE advantage of 0.03 over Contributor B becomes a 0.07 gap on real data.

Contributor A shows moderate generalisation and a meaningful recovery through fine-tuning — from 0.19 to 0.14 — but its marker-only performance reveals the same structural weakness: the model treats the specialised markers as noise rather than signal.

Business Impact

Illustrative assumptions: Institution receives €120,000 per contributor access period (6-month training window) / 4 contributors in the initial cohort / paediatric rare disease diagnostic journey currently averages 5 years / validated blood-based classifier could reduce pre-biopsy diagnostic window by 18 months for patients in screened populations

Scenario	Annual Revenue	Patient Impact	Companion Diagnostic Readiness
No external access	€0	—	—
Raw data transfer	One-time fee, asset relinquished	Delayed (legal timeline)	External — institution loses attribution
tracebloc workspace access ✅	€480,000 / year recurring	18-month earlier diagnosis window	Institution retains data custody and co-authorship

The recurring revenue model is the key differentiator. A one-time data transfer agreement generates a single payment and surrenders the asset. A tracebloc workspace generates access fees per contributor per period, scales with demand, and leaves the institution in full control of the data. As the PEDIOLIX cohort grows and new therapeutic areas are added, the workspace becomes more valuable — not less — because the training dataset improves without the institution losing custody.

Decision

Prof. Hartmann's institution selects Contributor B's architecture as the recommended approach for companion diagnostic development, based on marker-only MSE performance and the model's demonstrated ability to handle paediatric rare disease proteomic distributions. A co-development agreement is initiated: the institution provides ongoing cohort access through the tracebloc workspace; the contributor develops the companion diagnostic towards clinical-grade performance and includes the institution as a named data partner in any regulatory submission.

The tracebloc workspace stays active after the initial evaluation. As new contributors enter the market and the PEDIOLIX cohort adds new therapeutic areas and additional samples, each new contributor enters the same workspace under the same terms. The leaderboard becomes a live record of how liquid biopsy classification performance is progressing across contributors and architectures — turning a one-time research asset into an ongoing data collaboration infrastructure.

Explore this use case further:

View the model leaderboard — full contributor rankings, marker vs. protein performance breakdown
Explore the dataset — circulating marker distributions, protein panel statistics, skewness analysis
Start training — submit your own liquid biopsy classification model to this cohort

Related use cases: See how the same secure access model applies to pharmacodynamic proteomics validation in paediatric IBD and safety metabolomics validation in DILI. For a broader view of federated learning applications across pharma and healthcare, see our federated learning applications guide.

Deploy your workspace or schedule a call.

Disclaimer

Disclaimer: The dataset used in this use case is augmented — designed to reflect the statistical structure of real-world paediatric rare disease liquid biopsy proteomics data, including circulating marker distributions, protein expression ranges, and skew profiles characteristic of disease-state biomarker behaviour, without containing any identifiable patient records, diagnoses, or hospital data. The persona, contributor labels, claimed performance figures, revenue assumptions, and commercial scenario are illustrative and based on patterns observed across paediatric rare disease research environments. They do not represent any specific institution, company, dataset, or contractual arrangement.