picture_as_pdf Download PDF

IARC 60th Anniversary - 19-21 May 2026

Session : 21/05/26 - Posters

Can large language models assist with title–abstract screening in cancer evidence synthesis?

PEARCE M. 1, TENG L. 1, XU Z. 2, MARKOZANNES G. 1, MARGARITA C. 1, VIEIRA R. 1, PAPADIMITRIOU N. 1, DAVIES P. 2, MILLARD L. 2, SOBCZYK-BARAD M. 2, HIGGINS J. 2, MARTIN R. 2, GAUNT T. 2, POMBO SELEIRO E. 1, LIU Y. 2, TSILIDIS K. 1, CHAN D. 1

1 Imperial College London, London, United Kingdom; 2 University of Bristol, Bristol, United Kingdom

Background
The World Cancer Research Fund International Global Cancer Update Programme (CUP Global) synthesises evidence on lifestyle factors and cancer risk and survival. Conducting a systematic literature review has a considerable labour and time cost. The article screening phase at CUP Global often takes two experienced reviewers around one month to screen approximately 20,000 records to identify all relevant publications on a particular topic. This study examines whether a large language model (LLM) can assist title-abstract screening of cancer-related studies.

Objectives
Evaluate the performance and estimate the time saving of LLM-assisted title-abstract screening.

Methods
A random sample of 5104 records was drawn from a previously screened, standard PubMed systematic literature search for research related to lifestyle factors and breast cancer incidence. Sampling reflects the 96:4 ratio of excluded to included records observed in previous work. Priority scores indicating the relevance for a breast cancer incidence topic were retrieved from PU’ER (https://github.com/automation-in-systematic-reviews/puer). Scores were produced using a model pre-training approach for study screening which fine-tunes the biomedical BlueBERT LLM with pre-existing study screening decisions. The approach leverages a Siamese network to capture the relevance between the review topic and each candidate study. A larger set of 28,973 records was used to select thresholds for classifying articles as included, excluded or uncertain, using recall and precision calculated at each classification threshold.

Article screening is ongoing at the time of writing and four screening conditions are being assessed: 1) no LLM assistance (‘traditional’ approach); 2) priority score ranking from most to least relevant; 3) priority score and include/exclude classification using binary threshold at best precision at 100% recall; 4) priority score and include/uncertain/exclude classification using a lower threshold at best precision at 100% recall, and an upper threshold at 95% precision. Four independent reviewers are screening four randomly allocated batches of 1276 records in a cross-over experimental design, with each study record-condition pairing seen by one reviewer. Time taken to screen each record is estimated using automated timestamps.

Results
To account for the variability in the inherent difficulty of judging each record, as well as batch and reviewer differences, we will present results comparing conditions using a mixed modelling approach. Each new decision will be assigned correct or incorrect status by comparing with the ground truth from previous human screening. This binary outcome will be modelled using a generalised linear mixed model with binomial link, condition as a categorical fixed effect, and random effects for record, reviewer, and batch. Probability of being correct will be predicted for each record under each condition. The sum of the probabilities is then used to calculate expected true positive, true negative, false positive, and false negative values, followed by recall, precision, accuracy, and F1 score for each condition. Mean decision time per record will be compared in a similar way.

Conclusions/Implications
We provide empirical data on the implementation of LLM assistance in an applied evidence-synthesis setting. This work informs future methodological practice within CUP Global and the wider field of cancer evidence synthesis.