picture_as_pdf Download PDF

IARC 60th Anniversary - 19-21 May 2026

Session : 19/05/26 - Posters

Evaluating Large Language Models for Summarizing Radiology Reports into Personalized Patient Letters in a Lung Cancer Screening Program

WARKENTIN M. ¹, MULLIN M. ², BRENNER D. ¹, TREMBLAY A. ²

¹ Department of Oncology, Cumming School of Medicine, University of Calgary, Calgary, Canada; ² Department of Medicine, Cumming School of Medicine, University of Calgary, Calgary, Canada

Background: Effectively communicating to patients and primary care providers the key findings discovered during screening is an important aspect of an effective lung cancer screening program. However, personalizing letters for each patient requires additional human and healthcare resources. Text summarization is a strength of large language models (LLM) and may help with this task.

Objectives: To comprehensively evaluate several commercial and open-weight LLMs to summarize structured radiology reports into personalized patient letters in a lung cancer screening program.

Methods: Synthetic radiology reports and patient results letters were generated by shuffling different components of synoptic radiology reports and letters from participants in the Alberta Lung Cancer Screening Study (ALCSS). We followed the Inspect AI framework created by the UK AI Security Institute for LLM evaluations. We used an “LLM-as-the-judge” approach where an LLM is the solver (i.e., generates the personalized patient letter) and an LLM (the same one or different) is the scorer (i.e., evaluates how well the solver performs the summarization task). For solvers, we evaluated three commercial (ChatGPT 4.1, Claude Sonnet 4, Gemini 2.5 Pro) and three open-weight (GPT OSS, Qwen3, and LLaMA 3.3) LLMs. Solvers were provided a prompt to generate a patient letter based on a radiology report and a simple set of instructions. ChatGPT 4.1 was used as the scorer in all evaluations. The scorer was given a 5-item criterion (each scored 1 to 6 points) to evaluate the generated letter based on (1) clinical accuracy, (2) coverage of key findings, (3) readability and comprehension, (4) tone and empathy, and (5) length and focus. Generated patient letters required 80% (27 out of 30) or higher for task completion or 24 to 26 out of 30 for partial task completion. Grades are reported for each solver.

Results: Across 2,332 synthetic reports, ChatGPT 4.1, Claude Sonnet 4, Gemini 2.5 Pro, GPT OSS, Qwen3, and LLaMA 3.3 achieved grades of 95%, 94%, 88%, 90%, 91%, and 93%, respectively, for the task of generating patient letters from radiology reports. The commercial models scored 92.3% compared to 91.3% for the open-weight models. ChatGPT 4.1, Claude Sonnet 4, and Gemini Pro 2.5 cost $2.03, $4.32, and $2.01 USD per 1,000 patient letters. The Grok API was used to solve the summarization tasks with the open-weight models and GPT OSS, Qwen3, and LLaMA 3.3 cost $0.29, $0.47, and $0.44 USD per 1,0000 patient letters. On average, the commercial models cost $0.0028 USD per generated patient letter ($2.79 USD per 1,000 letters) and open-weight models cost $0.0004 USD ($0.40 USD per 1,000 letters).

Conclusions: Overall, both ChatGPT 4.1 and Claude Sonnet 4 achieved the best performance on summarizing radiology reports in patient-friendly letters. LLMs may provide an affordable and scalable solution for personalized patient letters within a lung cancer screening program. Next, we plan to assess patient preferences by providing paired-sets of synoptic and LLM-generated letters to patient-partners for review. LLMs can generate letters in most languages and at an appropriate reading level to ensure fair and equitable access to these patient-friendly summaries