Diagnostic Accuracy and Reproducibility of ChatGPT-4o for HER2 Immunohistochemistry Scoring in Equivocal Breast Cancer Cases

Charoenlap, Cheep; Arunsawat, Pakorn; Chienwichai, Kittiphan; Prateepchaiboon, Tanaporn; Chang, Arunchai

doi:10.31557/APJCP.2026.27.5.1703

Diagnostic Accuracy and Reproducibility of ChatGPT-4o for HER2 Immunohistochemistry Scoring in Equivocal Breast Cancer Cases

Document Type : Research Articles

Authors

¹ Department of Anatomical Pathology, Hatyai Hospital, Songkhla, Thailand.

² Division of Nephrology, Department of Internal Medicine, Hatyai Hospital, Songkhla, Thailand.

³ Division of Medical Oncology, Department of Internal Medicine, Hatyai Hospital, Songkhla, Thailand.

⁴ Division of Gastroenterology, Department of Internal Medicine, Hatyai Hospital, Songkhla, Thailand.

10.31557/APJCP.2026.27.5.1703

Abstract

Background: HER2 immunohistochemistry (IHC) plays a central role in therapeutic decision-making for breast cancer. However, interpretation of equivocal (2+) IHC results remains challenging and is subject to interobserver variability, necessitating reflex in situ hybridization testing. This study evaluated the diagnostic performance and reproducibility of ChatGPT-4o, a general-purpose large language model, in scoring HER2 IHC in breast cancer cases initially classified as IHC 2+. Methods: We retrospectively analyzed 81 formalin-fixed, paraffin-embedded invasive carcinoma of no special type (NST) cases with prior HER2 IHC 2+ scores and corresponding dual in situ hybridization (DISH) results. Five high-power field images per case were independently analyzed by ChatGPT-4o across three sessions, using a standardized prompt aligned with the ASCO/CAP 2023 guidelines. Cases remaining equivocal after AI-assisted interpretation were excluded from diagnostic performance calculations. HER2 DISH served as the reference standard. Results: Fourteen cases (17.3%) remained equivocal following AI interpretation. Among the 67 reclassified cases, ChatGPT-4o demonstrated an overall diagnostic accuracy of 79% (95% CI: 67–88%), a sensitivity of 30%, specificity of 100%, positive predictive value of 100%, and negative predictive value of 77%. Intra-model reproducibility was good (intraclass correlation coefficient = 0.78), whereas agreement with HER2 DISH was fair (Cohen’s κ = 0.375). Misclassification predominantly involved false-negative interpretations among HER2-positive cases. Conclusion: ChatGPT-4o demonstrated high specificity and reproducibility for identifying HER2 IHC 3+ cases but showed limited sensitivity and only fair concordance with HER2 DISH. These findings indicate that, in its current general-purpose form, ChatGPT-4o is not suitable for independent HER2 assessment and may serve, at best, as an exploratory adjunct to pathologist interpretation.

Keywords