D1.487 - Evaluation of Large Language Models in Otolaryngology: Accuracy on Structured Question Banks and S.C.O.R.E.-Based Interpretation of Guidelines and Consensus Statements

Poster abstract

Background

Large language models (LLMs) are increasingly used in otolaryngology for medical education, decision support, and guideline interpretation. However, limited research has explored hallucination phenomena associated with their clinical application in otolaryngology practice. Characterizing hallucinations in LLMs is key to guiding their appropriate clinical use in otolaryngology.

Method

To evaluate hallucination patterns and reasoning reliability of ChatGPT, Gemini and Qwen when performing otolaryngology focused multiple choice questions and when interpreting clinical practice guidelines using the S.C.O.R.E. evaluation framework.

Results

In an open-ended question assessment, the average accuracy of ChatGPT-5, Gemini 2.5 Flash and Qwen3-Max were 55.2%, 77.7% and 72.7%. Questions from the fields of head and neck had relatively low accuracy rates. On the other hand, the qualitative assessment using S.C.O.R.E found that three large language models responses were satisfactory. For OTO-HNS clinical practice guidelines questions, the average Likert scores were 5 for Safety, 4.5 for Consensus, 5 for Objectivity, 4.8 for Reproducibility, and 4.2 for Explainability.

Conclusion

Gemini 2.5 Flash and Qwen3-Max achieved a passing score in the sample exam, and demonstrated the potential to pass the Chinese Otolaryngology–Head and Neck Surgery National Senior Health Professional Qualification Examination.As the capabilities of large language models (LLMs) keep advancing and are increasingly being applied in otolaryngology (ENT), it has become even more critical to conduct a thorough assessment of these generative AI models and validate their performance in ENT-specific clinical contexts. By incorporating evaluation frameworks such as the S.C.O.R.E. system into the assessment process, we can ensure that LLM-driven models and systems are not only robust and accurate, but also safe, reliable, and trustworthy when deployed in routine clinical practice, medical decision-making, and patient care within the field of otolaryngology.