picture_as_pdf Download PDF

IARC 60th Anniversary - 19-21 May 2026

Session : 19/05/26 - Posters

Federated Inference Under Case-Cohort Sampling: The DRAFT Framework for Multi-Center Biomarker Studies

FARNUDI A. ¹, WANG Y. ², LUO C. ³, CHEN Y. ², VIALLON V. ¹

¹ IARC, Lyon, France; ² Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, Philadelphia, United States; ³ Division of Public Health Sciences, Washington University School of Medicine in St. Louis, 600 S Taylor Ave, St. Louis, United States

Background: Cancer remains a major global health challenge despite advances in prevention and treatment. High-throughput multi-omic profiling provides a systems-level view of tumor biology and its links to cancer risk and progression. Molecular epidemiology leverages these data to relate molecular signatures to disease incidence and etiology.

Combining data from several studies is increasingly needed in cancer epidemiology to improve statistical power and generalizability, especially when investigating rare cancer types or conducting detailed subgroup analyses. Ideally, individual-level data would be pooled into a single dataset for joint modeling—a gold standard that oftenbecomes infeasible due to strict privacy regulations like GDPR. These restrictions prohibit data sharing across institutions, creating a major barrier to integrated analyses and driving the need for privacy-preserving federated approaches. Meta-analysis, typically using inverse variance weighting, is the most common approach for federated analysis of decentralized data. However, meta-analysis is not designed to replicate pooled results and often yields biased estimates.

Federated algorithms have been developed to reproduce pooled analyses across a variety of applications, including survival analysis of cohort data; however, none accommodate cost?efficient sampling schemes such as the case?cohort design. This omission is critical, as full proteomic profiling in large cohorts like EPIC-Europe is cost?prohibitive, and case?cohort sampling offers a practical solution for multi?endpoint analyses.

Objectives: Develop a statistical method for federated analysis of survival data within cost?efficient designs, such as case?cohort studies. In particular, the method should recreate pooled analysis results, while minimizingcommunication overhead between data centers.

Methods: Building on the PDA (Privacy?Preserving Distributed Algorithms) R package, we developed DRAFT (Design-informed Regression Algorithms for Federated-learning Toolbox), a federated estimation method that generalizes surrogate likelihood-based federated algorithms by embedding Prentice weights and satisfying the above objectives.

We evaluated DRAFT using simulations and real data from EPIC-SomaLogic, an international multi?endpoint case?cohort study nested within EPIC?Europe and comprising proteomics measurements from four countries. The cancer EPIC-Somalogic sub-study includes measurements of ~7500 proteins profiled using SomaScan in ~10,000 EPIC participants. With the data centralised at IARC, we could derive results from pooled analyses and emulate decentralized settings to compare the pooled results with those obtained with meta?analysis and DRAFT, respectively.

Results: DRAFT’s estimates were nearly identical to the pooled results, whereas meta-analysis showed noticeable discrepancies, both for simulated and the EPIC data (Fig1-left-panels). In terms of statistical significance of the associations with cancer in the EPIC study, DRAFT outperformed meta-analysis by a wide margin and achievedhigh concordance with pooled analyses, as reflected by the sensitivity and F1?scores computed using the pooled results as the reference (Fig1, right panels).

DRAFT was able to mimic pooled analyses across different specific cancer types, including colorectal, prostate, lung, and breast cancer—consistently outperforming meta-analysis.

Conclusion: DRAFT performs well in both simulated and real datasets, demonstrating strong potential for future applications—particularly for integrating analyses across large cohorts and combining molecular data from different cohort studies.

Fig1