picture_as_pdf Download PDF

IARC 60th Anniversary - 19-21 May 2026

Session : 19/05/26 - Posters

Leveraging population-scale molecular trait data with deep transfer learning for cancer prediction

SHAKEEL A. ^1,2, MERRIEL S. ^3,4, SMITH J. ⁵, MCGOUGH . ⁶, ABDALLAH Z. ⁷, SUDERMAN M. ^1,2, YOUSEFI P. ^1,2

¹ MRC Integrative Epidemiology Unit, Bristol Medical School, University of Bristol, Bristol, United Kingdom; ² Population Health Sciences, Bristol Medical School, University of Bristol, Bristol, United Kingdom; ³ Division of Population Health, Health Services Research Primary Care (L5), Manchester, United Kingdom; ⁴ College of Medicine and Health, University of Exeter, Exeter, United Kingdom; ⁵ Royal Devon University Healthcare NHS Foundation Trust, Exeter, United Kingdom; ⁶ School of Computing, Newcastle University, Newcastle, United Kingdom; ⁷ School of Engineering Mathematics and Technology, University of Bristol, Bristol, United Kingdom

Background: Identifying robust biomarkers for cancer detection and progression remains challenging, particularly when working with limited or heterogeneous datasets.

Objectives: Here, we present a proof-of-concept deep transfer learning framework for cancer prediction using high-dimensional blood-based molecular trait profiles.

Methods: Our training data included plasma proteomes from 13,208 pan-cancer cases and 39,806 controls in UK Biobank. To address class imbalance and enrich the feature space, we trained a variational autoencoder on the pan-cancer cases and used it to augment the training data with synthetic pan-cancer samples. We then trained a variety of machine learning models on the original and the augmented datasets to distinguish between cases and controls. Models included a Convolutional Neural Network (CNN) and other more traditional machine learning models including XGBoost, Support Vector Machines, and Elastic Net regression. Performance was evaluated in an independent saliva-based dataset from a head and neck cancer case-control study (n = 156).

Results: The CNN trained on augmented data (AUC = 0.88) surpassed all other models and training scenarios (AUC < 0.77). SHapley Additive explanations of this model identified well-known cancer markers as key predictive features, including IL6, CXCL17, CXCL13, IGF1R and FASLG.

Conclusions/implications: These results highlight the potential of deep transfer learning for leveraging high-dimensional molecular data from large general populations to improve predictive performance, even for cancers with limited numbers for training. Further, synthetic data augmentation by deep generative models shows promise for improving model performance in class imbalanced settings. At the conference, we’ll additionally present preliminary results for a similar investigation of DNA methylation for predicting prostate cancer outcomes.