
Liquid Biopsy Proteomics for Non-Invasive Paediatric Rare Disease
Participants
36
End Date
30.11.26
Dataset
d4onaj8u
Resources2 CPU (8.59 GB) | 1 GPU (22.49 GB)
Compute
0 / 100.00 PF
Submits
0/5

36
30.11.26
On this page
About this use case: A paediatric hospital holds 1,200 rare disease liquid biopsy samples that took eight years to assemble — a dataset pharma and diagnostics companies would pay to train on, except that GDPR and re-identification risk make a raw data transfer impossible. tracebloc lets companies train classification models inside the hospital's infrastructure and generates recurring access-fee revenue without a single patient record ever leaving. Explore the data, submit your own model, and see how your approach compares.
The Universitäts-Kinderklinik Hartenberg holds the most comprehensive liquid biopsy dataset for paediatric rare disease in Europe: 1,200 blood-based proteomic samples spanning circulating tumour markers, broad protein panels, and clinical variables from children across multiple rare disease groups. Pharmaceutical companies developing non-invasive diagnostics need to train classification models on this data. Ethics approval, GDPR compliance, and re-identification risk mean they cannot access a single patient record directly.
Prof. Dr. Elisabeth Hartmann, Director of Clinical Research Data, deploys a tracebloc workspace loaded with 1,200 paediatric liquid biopsy samples. Pharmaceutical and diagnostics companies submit their classification models to the workspace. Inside tracebloc's containerised training environment, models train on the patient sample set — fine-tuning weights to the skewed marker distributions and protein expression patterns specific to paediatric rare disease — without the data ever leaving the hospital's infrastructure. tracebloc orchestrates training, scores each adapted model against the evaluation cohort, and publishes results to a live leaderboard automatically. This is a federated learning application of data monetisation: the institution controls the asset, the contributors train on it, and not one patient record moves. Each contributor pays a training access fee. The institution generates recurring research revenue from data it already owns.
In this example evaluation, contributors range from diagnostics firms validating existing blood-based assay models against a paediatric rare disease cohort to pharma companies building novel cell-free DNA analysis classifiers from scratch. The workspace surfaces which approaches handle the skewed marker distributions effectively and which collapse when rare-disease-elevated markers dominate predictions. The tracebloc workspace stays in place as the institution's liquid biopsy cohort grows, enabling continuous model improvement without renegotiating access.
Prof. Hartmann's institution has been running the PEDIOLIX cohort study for eight years. The liquid biopsy layer of the cohort — blood-based proteomic samples from children across 13 therapeutic areas — is the result of a data collection programme that no pharma company could replicate independently. It required ethics approval, patient consent infrastructure for a paediatric population, multi-year longitudinal sampling, and a clinical team experienced in paediatric rare disease phenotyping. The data exists because a hospital built it. It is not a dataset any company can buy.
The diagnostic gap it addresses is substantial. For most paediatric rare diseases, the path to diagnosis still runs through invasive procedures: tissue biopsies, bone marrow aspirates, lumbar punctures. These are painful, carry procedural risk, and are difficult to justify in young children without strong clinical suspicion. The average rare disease diagnosis takes five years from first symptoms. In that window, children are treated empirically, disease progresses, and families carry a burden of uncertainty that compounds the clinical problem.
Blood-based biomarkers — blood based biomarkers circulating in plasma after tissue release — could transform this pathway. If proteomics from a standard blood draw can classify disease state accurately enough, invasive procedures can be reserved for confirmation rather than used as first-line diagnostics. Several pharma and diagnostics companies are actively developing liquid biopsy classifiers for paediatric rare disease. Every one of them needs access to an external paediatric cohort to validate or train their model. None of them can replicate the PEDIOLIX sample base.
The commercial interest is real and growing. Companies are willing to pay for training access to this cohort. The institution's problem is not demand — it is compliance. A data transfer agreement covering 1,200 paediatric patients with rare disease diagnoses, proteomic profiles, and clinical variables would require months of legal review, GDPR impact assessments, and data protection board sign-off. And at the end of that process, the institution would have given away its most valuable asset. The dataset — once transferred — is no longer under the institution's control. It can be copied, redistributed, and used beyond any agreement's scope.
The institution's research ethics board and data protection officer have been clear: no raw transfer. But the institution also cannot afford to leave recurring commercial revenue on the table. Research budgets are under pressure. Clinical data assets that took eight years to build deserve to generate ongoing value.
The evaluation dataset contains 1,200 paediatric liquid biopsy samples from children across multiple rare disease groups. Full dataset statistics, feature distributions, and marker behaviour are available in the Exploratory Data Analysis tab.
This dataset is augmented. It was constructed to reflect the statistical structure of a real-world paediatric rare disease liquid biopsy cohort — the circulating marker distributions, the broad protein expression ranges, and the skew profiles characteristic of disease-state versus baseline biomarker behaviour — without containing any identifiable patient records, diagnoses, or hospital data.
| Property | Value |
|---|---|
| Total samples | 1,200 |
| Features | 153 |
| Circulating proteins | ~100 (continuous, bulk blood protein levels) |
| Specialised disease markers | ~20 (continuous, highly skewed — baseline in most, elevated in disease-state subset) |
| Clinical variables | ~30 (continuous, clinical phenotype measurements) |
| Categorical features | 3 (including patient identifier and classification target) |
| Missing values | None |
| Zero inflation | None across all examined features |
| Evaluation metric | MSE |
A note on the marker features: The specialised disease markers show highly skewed distributions with heavy right tails. Most patients show near-baseline values; a subset shows strongly elevated readings. This is not a data quality issue — it is the diagnostic signal. Disease-state elevation in these markers is what makes liquid biopsy diagnostically useful. Models that cannot handle right-skewed, sparse biomarker distributions will underperform relative to those that do.
Each contributor submitted their non-invasive classification model to the tracebloc workspace. The evaluation ran in two phases.
Phase 1 — Out-of-the-box performance. Each model was scored as submitted, with no adaptation to this paediatric cohort. This establishes what the system actually delivers when applied to a new patient population — typically the number that diverges most sharply from what contributors claim in their proposals.
Phase 2 — Fine-tuning. Contributors were given access to the training environment inside the tracebloc workspace. Each contributor transferred their model into the workspace and ran training on the 1,200-sample patient cohort. This fine-tuned the model weights to the specific marker distributions, protein expression patterns, and clinical variable ranges of the paediatric rare disease population — adapting from a generalised classifier to one calibrated for this cohort. After training, the adapted model was evaluated automatically against the held-out evaluation set. The patient data never left the hospital's infrastructure. Contributors received only their own results; no contributor had visibility into another's training runs or scores before the leaderboard published.
→ View the full model leaderboard — complete contributor rankings, MSE breakdown, and marker feature attribution across all submissions.
| Contributor | Claimed MSE | Out-of-the-Box | After Fine-tuning | Marker-Only MSE |
|---|---|---|---|---|
| Contributor A | 0.11 | 0.19 | 0.14 | 0.22 |
| Contributor B ✅ | 0.13 | 0.15 | 0.11 | 0.13 |
| Contributor C ⚠️ | 0.10 | 0.24 | 0.18 | 0.31 |
| Contributor D | 0.14 | 0.21 | 0.16 | 0.25 |
What the numbers reveal:
Contributor B is the only model to improve beyond its own claimed MSE after fine-tuning on this paediatric cohort — moving from 0.13 claimed to 0.11 post-fine-tuning. More tellingly, its marker-only MSE of 0.13 is the strongest in the evaluation: this model has learned to extract diagnostic signal from the skewed, sparse circulating marker features that define liquid biopsy performance. Its architecture handles right-tailed distributions natively rather than transforming them away.
Contributor C entered with the strongest claimed MSE at 0.10 — the most optimistic proposal in the evaluation. On this paediatric cohort, it delivered 0.24 out-of-the-box and recovered only to 0.18 after fine-tuning. Its marker-only MSE of 0.31 indicates that the model is essentially ignoring the disease markers and relying on the broader protein panel, which carries less disease-specific signal in this population. A claimed MSE advantage of 0.03 over Contributor B becomes a 0.07 gap on real data.
Contributor A shows moderate generalisation and a meaningful recovery through fine-tuning — from 0.19 to 0.14 — but its marker-only performance reveals the same structural weakness: the model treats the specialised markers as noise rather than signal.
Illustrative assumptions: Institution receives €120,000 per contributor access period (6-month training window) / 4 contributors in the initial cohort / paediatric rare disease diagnostic journey currently averages 5 years / validated blood-based classifier could reduce pre-biopsy diagnostic window by 18 months for patients in screened populations
| Scenario | Annual Revenue | Patient Impact | Companion Diagnostic Readiness |
|---|---|---|---|
| No external access | €0 | — | — |
| Raw data transfer | One-time fee, asset relinquished | Delayed (legal timeline) | External — institution loses attribution |
| tracebloc workspace access ✅ | €480,000 / year recurring | 18-month earlier diagnosis window | Institution retains data custody and co-authorship |
The recurring revenue model is the key differentiator. A one-time data transfer agreement generates a single payment and surrenders the asset. A tracebloc workspace generates access fees per contributor per period, scales with demand, and leaves the institution in full control of the data. As the PEDIOLIX cohort grows and new therapeutic areas are added, the workspace becomes more valuable — not less — because the training dataset improves without the institution losing custody.
Prof. Hartmann's institution selects Contributor B's architecture as the recommended approach for companion diagnostic development, based on marker-only MSE performance and the model's demonstrated ability to handle paediatric rare disease proteomic distributions. A co-development agreement is initiated: the institution provides ongoing cohort access through the tracebloc workspace; the contributor develops the companion diagnostic towards clinical-grade performance and includes the institution as a named data partner in any regulatory submission.
The tracebloc workspace stays active after the initial evaluation. As new contributors enter the market and the PEDIOLIX cohort adds new therapeutic areas and additional samples, each new contributor enters the same workspace under the same terms. The leaderboard becomes a live record of how liquid biopsy classification performance is progressing across contributors and architectures — turning a one-time research asset into an ongoing data collaboration infrastructure.
Explore this use case further:
Related use cases: See how the same secure access model applies to pharmacodynamic proteomics validation in paediatric IBD and safety metabolomics validation in DILI. For a broader view of federated learning applications across pharma and healthcare, see our federated learning applications guide.
Deploy your workspace or schedule a call.
Disclaimer: The dataset used in this use case is augmented — designed to reflect the statistical structure of real-world paediatric rare disease liquid biopsy proteomics data, including circulating marker distributions, protein expression ranges, and skew profiles characteristic of disease-state biomarker behaviour, without containing any identifiable patient records, diagnoses, or hospital data. The persona, contributor labels, claimed performance figures, revenue assumptions, and commercial scenario are illustrative and based on patterns observed across paediatric rare disease research environments. They do not represent any specific institution, company, dataset, or contractual arrangement.