picture_as_pdf Download PDF

IARC 60th Anniversary - 19-21 May 2026

Session : Population cohorts, biobanking and research infrastructures

Multivariable Proteomic Associations of Cancer Incidence Identified by Machine Learning in the UK Biobank

PAPAGIANNOPOULOS C. 1, BOURAS E. 1,2, MARKOZANNES G. 1,2, CHALITSIOS C. 1, GUNTER M. 2, TZOULAKI I. 2,3, PAPANDREOU C. 1,4,5, TSILIDIS K. 1,2

1 Department of Hygiene and Epidemiology, School of Medicine, University of Ioannina, Ioannina, Greece; 2 Department of Epidemiology and Biostatistics, School of Public Health, Imperial College London, London, United Kingdom; 3 Biomedical Research Foundation of the Academy of Athens, Athens, Greece; 4 Institut d'Investigació Sanitària Pere Virgili (IISPV), NeuroÈpia Group, Hospital Universitari Sant Joan de Reus, Tarragona, Spain; 5 Department of Nutrition and Dietetics Sciences, School of Health Sciences, Hellenic Mediterranean University (HMU), Siteia, Greece

Background
Large-scale proteomics studies combined with machine learning (ML) can enhance our understanding of carcinogenesis.

Objectives
This prospective cohort study aimed to: (1) identify robust multivariable protein-cancer risk (colorectal, lung, hematologic, melanoma, breast, prostate) associations using ML, and (2) quantify non-linear associations and interactions.

Methods
We included 39,624 UK Biobank participants, who were cancer free until two years after recruitment with measurement of 2,911 plasma proteins related to cardiometabolic, inflammation, neurology, and oncology functions. To minimize confounding we derived proteomic residuals by regressing in total 18 cancer-specific risk factors (demographics, socio-economic, lifestyle, anthropometric, environmental, family history of any cancer, reproductive health) on each protein using linear regression models. We subsequently performed protein selection via a multivariable multistage ML framework combining Cox regularization, stability selection and ensemble bagging aggregation. Four Bayesian-optimized gradient boosting models (LightGBM) were developed to model cancer incidence using: (i) cancer-specific risk factors; (ii) selected proteins; (iii) selected proteins subsequently including cancer-specific risk factors; and (iv) selected protein residuals. Protein associations with cancer risk observed in both models iii) and iv) were characterized as robust protein signals. Shapley additive explanations were used to interpret model outputs and to quantify non-linear protein-cancer associations and protein-protein interactions of each robust signal.

Results
Over a median follow-up of 14.5 years, 5,332 participants developed cancer. Robust protein signals were identified for each cancer type including colorectal (RBP2, INSL5); lung (ALPP, CEACAM5, CEACAM6); hematologic (QPCT, BCL2, TNFSF13B, FCRL3, FCRL2, PDCD1, CD79B, TNFRSF13B, LY9); melanoma (IL17RB); breast (CRLF1, CENPJ); and prostate (TSPAN1, KLK3). Across the four LightGBM models (i, ii, iii, iv), the area under the curve varied by cancer, with the highest predictive performance observed for prostate (0.63, 0.80, 0.69, 0.77), and lung (0.74, 0.81, 0.82, 0.63) cancers, and moderate for hematologic (0.51, 0.62, 0.59, 0.62), colorectal (0.55, 0.64, 0.64, 0.52), melanoma (0.52, 0.53, 0.57, 0.58), and breast (0.52, 0.57, 0.54, 0.56) cancers. Based on model (iv), several robust protein signals exhibited non-linear associations with cancer risk (IL17RB, TSPAN1, RBP2, INSL5, ALPP, BCL2, TNFSF13B, FCRL3), as well as notable protein-protein interactions (lung: ALPP-WFDC; hematologic: CD79B-TNFRSF13B, LY9-PDCD1), highlighting complex relationships underlying cancer development.

Conclusions/Implications
Proteomics improved cancer risk prediction beyond classical risk factors for prostate, hematologic and lung cancers. The identified protein signals highlight previously underexplored multivariable protein–cancer associations, independent from established risk factors, supporting potential implications to precision medicine.