picture_as_pdf Download PDF

IARC 60th Anniversary - 19-21 May 2026

Session : 19/05/26 - Posters

Comparative Performance of Large Language Models in Common Cancer Management Based on a Guideline-Based Standardized Questionnaire

YAN Y. ¹, LI S. ¹, HE Z. ¹, WU D. ¹, ZHANG Y. ¹, DAI M. ¹, CHEN H. ¹

¹ CAMS and PUMC: Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing, China

Background: Large language models (LLMs) are increasingly used for cancer-related information support, yet their performance in guideline-based cancer management within the Chinese clinical context remains unclear.
Objective: To systematically evaluate and compare the performance of multiple widely used LLMs in answering guideline-derived cancer management questions in China.
Methods: We evaluated nine widely deployed LLMs (Claude 4, DeepSeek- R1/V3, GPT-4 Turbo, GPT-5, Gemini 2.5, Grok 4, Kimi K2, and Qwen3) using 108 standardized open-ended questions derived from national guidelines for seven common cancers. Each model answered all questions under two prompting conditions (guideline-anchored vs. unguided). Accuracy and completeness were scored using an LLM-jury framework to generate composite quality scores. Five automated readability indices were computed to quantify textual complexity. In addition, practical implementation metrics, including response time and token-based generation cost, were also recorded and analyzed.
Results: Grok 4 achieved the highest composite quality under guideline-anchoredguideline anchoring prompting (median 5.00 [IQR 0.33]), whereas GPT-5 performed best under unguided promptingwithout anchoring (median 4.33 [IQR 1.50]). Guideline-anchored promptingGuideline anchoring markedly improved Grok 4’s quality (Δ = 1.01, 95% CI 0.82–1.20; Cliff’s δ = 0.69) but showed minimal effect on other models. Readability varied substantially, with Kimi K2 (win rates 0.88–0.90) and GPT-4 Turbo (win rates 0.71–0.74) generating the most readable outputs. Response times differed markedly, (4.81–47.84 s)ranging from 4.81 to 47.84 seconds. Cost-performance analysis showed that Grok 4 and GPT-5 delivered top-tier quality but at substantially higher cost, whereas the DeepSeek models provided mid- to high-quality performance at markedly lower cost.
Conclusions: General-purpose LLMs can generate guideline-concordant cancer management recommendations in the Chinese clinical context, but exhibit wide heterogeneity in quality, readability, latency, and cost. Model-specific validation, domain-adaptation, and scenario-aligned operational control are essential before deploying LLMs in real-world cancer screening and patient-communication workflows.

Figure 1