FL Applications
FL Use Cases
Start Training
Metadatasets
FL Clients
Docs
Login Icon
Website
Guest user
Signup
cover

EU AI Act Compliance: Paediatric Bleeding Disorder Stratification

Participants

9

End Date

25.02.27

Dataset
d1133xgc
Resources2 CPU (8.59 GB) | 1 GPU (22.49 GB)
Compute
0 / 100.00 PF
Submits
0/5

On this page

Book a live demo

Overview

About this use case: Four candidate pharmacodynamic biomarker panels all look competitive on internal trial data — and none of them has been tested on a patient population they weren't developed on. tracebloc runs the bake-off on an independent paediatric IBD cohort, scoring every panel under identical conditions, without a single longitudinal sample leaving the hospital. Explore the data, submit your own model, and see how your approach compares.

Problem

Four candidate pharmacodynamic proteomics models have been developed internally to predict anti-TNF treatment response in paediatric IBD patients. Before any panel can advance to clinical use, proteomics data analysis on an independent validation cohort is required — the same holdout set, the same evaluation pipeline, no exceptions. Without this step, performance claims from internal trials are unverifiable.

Solution

Dr. Clara Mendes, Head of Translational Research at a mid-size clinical-stage biotech, deploys a tracebloc workspace loaded with 800 longitudinal proteomics samples from an independent paediatric IBD cohort — approximately 160 unique patients, each observed at up to five timepoints during biologic induction. Each internal team submits their candidate model to the workspace. Inside tracebloc's containerised training environment, models are fine-tuned on the patient sample set — adapting pharmacokinetic, proteomic, and clinical feature weights to this independent cohort — without any data leaving the clinical institution's infrastructure. tracebloc orchestrates execution, scores each adapted model against the held-out evaluation set, and publishes results to a live leaderboard automatically. This is a federated learning application of pre-deployment validation: the proteomics data stays on the hospital's infrastructure from start to finish.

Outcome

In this example evaluation, one candidate model maintained consistent performance across both internal and external cohorts — the only panel to hold its MSE within acceptable bounds at the 72-hour and Day 7 timepoints where clinical intervention is still feasible. Two models showed statistically significant degradation on this independent cohort, indicating overfitting to the original trial population. The tracebloc workspace stays in place for re-evaluation as biomarker panels are updated and new development candidates emerge.

The Operational Challenge

Dr. Mendes's team is preparing a regulatory package for a next-generation anti-TNF biologic in paediatric inflammatory bowel disease. Internal Phase II data has yielded four candidate pharmacodynamic biomarker panels, each identifying a different combination of proteomic features measured at early timepoints after induction to predict clinical response at week 12. The panels use different feature subsets, different timepoint windows, and different model architectures. All four look competitive on internal validation. None has been tested on data the model has never seen.

Anti-TNF biologics are the standard of care for moderate-to-severe paediatric IBD, but response rates are inconsistent. Roughly 30–40% of paediatric patients lose response within the first year. If a validated PD biomarker panel can identify non-responders at 72 hours or Day 7 post-infusion — before the second dose — clinicians can switch to alternative biologics before disease progression worsens and before irreversible mucosal damage accumulates. That early signal has direct implications for trial design: response-enriched enrolment, cleaner endpoints, and a shorter path to IND filing.

The problem is that none of the four candidate panels can be declared superior based on internal data alone. Each was trained on the same company's Phase II cohort — the same patient mix of Crohn's and ulcerative colitis, the same dosing regimen, the same collection protocol. Internal validation metrics are unreliable when the training distribution and the validation distribution are identical. A panel that predicts well internally might be picking up drug-specific pharmacodynamic effects rather than disease-specific biology. Or it might be overfitting to the Crohn's-heavy composition of the original trial, collapsing when tested on a cohort with a different Crohn's-to-UC ratio.

Regulatory agencies are increasingly explicit about this requirement. External validation on an independent cohort is not optional documentation — it is the evidence that distinguishes a generalised biomarker from a trial-specific artefact. Without it, the regulatory submission carries a credibility gap that reviewers will find.

The operational barrier is access. Longitudinal proteomics data from paediatric IBD patients — with dense temporal sampling during biologic induction, drug exposure parameters, and treatment response labels — is among the most tightly governed clinical data in the field. Ethics approvals take months. GDPR compliance for any cross-border transfer requires legal review. And even within a single jurisdiction, sharing the raw patient-level dataset with an external institution raises re-identification concerns that data protection officers will not approve quickly.

Dr. Mendes needs a way to run the bake-off on a real independent cohort, under controlled evaluation conditions, with results that are auditable for inclusion in regulatory documentation — without any patient record leaving the hospital.

Stakeholders

  • Dr. Clara Mendes, Head of Translational Research: Owns the PD biomarker strategy. KPIs: external validation MSE, feature consistency across cohorts, regulatory submission timeline. Needs documented evidence of generalisation before the IND package is filed.
  • Chief Medical Officer: Responsible for clinical development strategy. Needs confidence that the biomarker panel will hold up in the pivotal trial before committing resources. A failed biomarker at Phase III is a programme-ending event.
  • Head of Bioinformatics: Manages the four candidate model development teams. Needs a neutral evaluation environment where each team's model is scored on the same holdout set under identical conditions — internal competition without internal politics.
  • Data Protection Officer: GDPR compliance for any data transfer to external parties. Patient-level longitudinal proteomics with clinical response labels creates re-identification risk. No transfer without full legal clearance.
  • Regulatory Affairs Lead: Responsible for IND filing and clinical hold risk. External validation data needs to be auditable — every submission logged, metrics reproducible, methodology documented for agency review.

The Underlying Dataset

The evaluation dataset contains 800 longitudinal proteomics samples from approximately 160 paediatric IBD patients observed across up to five timepoints during the first two weeks of biologic induction therapy. Full dataset statistics, feature distributions, and temporal structure are available in the Exploratory Data Analysis tab.

This dataset is augmented. It was constructed to reflect the statistical structure of a real longitudinal anti-TNF induction proteomics cohort — the timepoint distribution, the pharmacokinetic feature range, the protein expression variance — without containing any identifiable patient records, hospital data, or treatment outcome information.

PropertyValue
Total samples800
Approximate unique patients160
Timepoints5 — Baseline (164), Day 7 (164), 72h (161), 24h (161), Day 14 (150)
Features183
Proteomic measurements~150 (anonymised protein expression readouts)
Pharmacokinetic parameters10 (drug exposure, clearance, concentration metrics)
Clinical variables~20 (clinical phenotype measurements)
Missing valuesNone
Evaluation metricMSE (equivalent to Brier score for binary response label)

A note on the temporal structure: Each patient appears at up to five timepoints mapping to the standard anti-TNF induction monitoring window — pre-infusion Baseline, acute response at 24h and 72h, early maintenance at Day 7, and induction completion at Day 14. Models that use specific timepoint snapshots (e.g. 72h only) can be compared directly against models that use temporal trajectories across multiple timepoints. The near-even distribution across timepoints (18.8%–20.5% per class) means no single timepoint dominates the evaluation.

How Evaluation Works

Each candidate model was submitted to the tracebloc workspace. The evaluation ran in two phases.

Phase 1 — Out-of-the-box performance. Each model was scored as submitted, with no adaptation to the external validation cohort. This establishes the true generalisation baseline: what the system actually delivers when applied to data it has never seen, from a patient population it was not trained on.

Phase 2 — Fine-tuning. Each team was given access to the training environment inside the tracebloc workspace. Models were transferred into the workspace and trained on the external patient sample set. This training process fine-tuned the model weights to the feature distributions, pharmacokinetic ranges, and proteomic expression patterns of this independent cohort — adapting from a model calibrated on one trial population to a system optimised for this independent dataset. After training, the adapted model was evaluated automatically against the held-out sample set. The patient data never left the hospital's infrastructure. Each team received only their own results; no team had visibility into another's scores before the leaderboard published.

Each contributor received:

  • Training access: 800 longitudinal samples (five timepoints, 183 features) for model adaptation inside the workspace
  • Evaluation environment: Sandboxed execution — adapted models run against the evaluation set, no patient data export path available
  • Metrics tracked: MSE on the binary treatment response label (equivalent to Brier score), performance breakdown by timepoint (Baseline, 24h, 72h, Day 7, Day 14), and feature importance outputs for regulatory documentation
  • Temporal constraint: Separate evaluation tracks for single-timepoint models (e.g. 72h snapshot) and multi-timepoint trajectory models — enabling fair comparison across architecturally different approaches

Results

→ View the full model leaderboard — complete candidate rankings, MSE by timepoint, and feature consistency analysis across all submissions.

ModelInternal MSEOut-of-the-BoxAfter Fine-tuning72h MSEDay 7 MSE
Panel A0.0890.1240.1030.1090.107
Panel B ✅0.0910.0980.0940.0960.098
Panel C ⚠️0.0850.1570.1310.1680.142
Panel D0.0930.1410.1180.1290.122

What the numbers reveal:

Panel B is the only candidate to maintain near-internal-validation performance on the independent cohort — moving from 0.091 MSE internally to 0.094 after fine-tuning on the external patient sample set. Its 72h performance of 0.096 MSE is the critical result: this is the timepoint where a validated prediction can influence the second-infusion decision. Panel B achieves this without requiring temporal trajectory modelling — it operates on the 72h snapshot alone, simplifying clinical deployment and regulatory documentation considerably.

Panel C had the best internal MSE at 0.085. On the external cohort, it collapsed — 0.157 out-of-the-box, recovering only partially to 0.131 after fine-tuning. The degradation pattern suggests overfitting to the specific Crohn's-to-UC composition of the original trial. This model's proteomic feature weights are optimised for a patient mix that does not reflect the distribution in this independent cohort. An internal MSE advantage of 0.006 over Panel B evaporates into a 0.037 gap on external data.

Panel A and Panel D both show moderate generalisation, with post-fine-tuning MSE landing between the extremes. Neither achieves Panel B's 72h performance, and both require longer timepoint windows (Day 7 or Day 14) to reach acceptable MSE — limiting their utility for early non-responder identification before the second infusion.

Business Impact

Illustrative assumptions: Paediatric IBD programme / 400 patients enrolled in pivotal trial / 30% non-responder rate (120 patients) / €35,000 cost per avoidable treatment episode where non-responders continue on ineffective therapy for 12 weeks before switch / 8-month delay to IND filing if external validation is rejected by regulatory agency

ScenarioNon-responders Identified EarlyAvoidable Treatment CostIND TimelineProgramme Risk
No external validation——+8 months (regulatory query)High — unverifiable biomarker claim
Panel C (best internal)58% at Day 14€1,764,000 savedOn scheduleHigh — generalisation failure
Panel B ✅81% at 72h€3,402,000 savedOn scheduleLow — validated generalisation

Panel B's 72h performance translates to identifying 81% of non-responders before the second infusion, enabling earlier switch decisions and reducing avoidable treatment exposure. Panel C's Day 14 performance identifies fewer non-responders later — after another six weeks of ineffective therapy — and its external validation failure carries the additional cost of an IND filing delay.

The value of the bake-off is not just picking the right panel. It is the audit trail: every submission logged, every MSE computed under identical conditions, every result reproducible. That documentation is what closes the regulatory credibility gap.

Decision

Dr. Mendes's team selects Panel B for inclusion in the IND submission. External validation MSE of 0.094 — within 3% of internal performance — provides the generalisation evidence the regulatory package requires. The 72h single-timepoint architecture simplifies clinical implementation: one blood draw after the first infusion, one prediction, one clinical decision.

The tracebloc workspace stays active after the initial bake-off. As the pivotal trial enrols and accumulates new longitudinal proteomic data, the candidate panel can be re-evaluated continuously on the growing sample set inside the same controlled environment. If a new biomarker discovery suggests a feature modification, the updated panel enters the same holdout evaluation without rebuilding any infrastructure. The leaderboard becomes a live record of panel performance over time.

Explore this use case further:

  • View the model leaderboard — full candidate rankings, MSE by timepoint, feature consistency analysis
  • Explore the dataset — longitudinal sample structure, pharmacokinetic distributions, proteomic feature variance
  • Start training — submit your own proteomics model to this evaluation

Related use cases: See how the same secure evaluation approach applies to patient stratification from genomic biomarker panels and liquid biopsy classification in paediatric rare disease. For a broader view of federated learning applications across pharma and healthcare, see our federated learning applications guide.

Deploy your workspace or schedule a call.

Disclaimer

Disclaimer: The dataset used in this use case is augmented — designed to reflect the statistical structure of real-world longitudinal pharmacodynamic proteomics data from paediatric IBD patients, including timepoint distribution, pharmacokinetic feature ranges, and protein expression variance, without containing any identifiable patient records, hospital data, or treatment outcomes. The persona, candidate panel labels, performance figures, business impact assumptions, and regulatory scenario are illustrative and based on patterns observed across paediatric biologic development programmes. They do not represent any specific company, clinical study, or regulatory submission.