picture_as_pdf Download PDF

IARC 60th Anniversary - 19-21 May 2026

Session : 19/05/26 - Posters

Fake it til you make it: The Potential of synthetic data and AI in data-driven cancer control

SLOEP M. ¹, CONSOLI S. ², KATSIMPOKIS D. ¹, MELONI A. ³, REFORGIATO RECUPERO D. ³, GELEIJNSE G. ¹

¹ Netherlands comprehensive cancer organisation, Utrecht, Netherlands; ² European Commission, Joint Research Centre (DG JRC),, Ispra, Italy; ³ Department of Mathematics and Computer Science, University of Cagliari, Cagliari, Italy

Background
With the rising global cancer incidence, cancer registries are increasingly important to steer and evaluate data-driven cancer control policies. Such increasing demand for registry data, including applications of artificial intelligence (AI), places growing pressure on registries to provide timely access to high-quality data while complying with strict privacy and legal regulations. These constraints can slow innovation, increase workload for registry staff, and limit cross-border collaboration. Synthetic data offer a promising solution by enabling privacy-enhancing methodological development without direct access to sensitive patient data

Objectives
To evaluate whether synthetic cancer registry data can be effectively used to develop and validate advanced data analyses. In particular, the application of AI models for survival prediction, we employ synthetic data succeeded by validation with “real” registry data. The potential benefits of this approach for cancer registries in terms of privacy protection, workload reduction, and support for reproducible, policy-relevant research are assessed.

Methods
We developed a survival prediction analysis using large language models (LLMs) trained exclusively on a synthetic breast cancer dataset (60,000 records) generated to emulate the Netherlands Cancer Registry. Multiple LLM architectures (encoder-only, decoder-only, and encoder–decoder) were fine-tuned using a natural-language representation of structured registry variables and compared with established statistical and machine-learning approaches (Cox regression and gradient boosting). Model performance was evaluated using standard survival metrics. The best-performing models were deployed—without retraining—within a cancer registry environment on a population-based cohort of 183,304 patients.

Results
LLM-based models trained on synthetic data consistently outperformed classical approaches in survival prediction, achieving higher concordance and discrimination. Importantly, models trained solely on synthetic data generalized well to real-world registry data, with only limited degradation in calibration and strong preservation of discriminative performance. The findings show that well-constructed synthetic data can serve as a dependable foundation for model development and benchmarking, substantially reducing data access delays and the operational risks associated with sharing sensitive patient-level data.

Conclusions / Implications
Our findings show that synthetic cancer registry data can enable safe, efficient, and reproducible development of advanced analyses, and highlights the relevance for AI and LLM models in the field of cancer surveillance. By decoupling model development from real patient data, this approach reduced privacy risks and lowers administrative and operational burden, potentially beneficial for registries in low- and middle-income countries (LMICs). At this moment the applicability is relatively low, because the burden is partly shifted to expertise and resources required for creating well-constructed synthetic data. However the potential of synthetic data–driven workflows is clear, it can accelerate innovation while maintaining scientific validity, supporting more equitable participation in cancer analytics across diverse health system contexts. These benefits align with the WHO guidance on AI in health, and underscore the importance of deploying AI and digital innovation in ways that reduce, rather than exacerbate, existing inequities in global cancer surveillance and research. The Global Initiative on Cancer Registry Development may consider including digital and AI capabilities, including synthetic data generation, in its training and support program.