D3.429 - Performance Analysis of Five Large Language Models, Including GPTo1-Preview, On an Allergy and Clinical Immunology Exam

Poster abstract

Background

Advances in artificial intelligence have enhanced the potential of large language models (LLMs) in medical education and assessment. While existing studies primarily focus on general medical exams such as the United States Medical Licensing Examination (USMLE), this study is the first to evaluate the performance of LLMs in a specialized field—namely, allergy and immunology.

Method

In this comparative and cross-sectional study, the performances of five LLMs (ChatGPT-4o, Gemini 1.5 pro, Claude 3.5 Sonnet, Llama 3.1 405b and GPTo1-preview) and 58 expert physician candidates were evaluated in the National Allergy and Clinical Immunology Board Examination. Each participant responded to 100 multiple-choice questions presented in Turkish. The questions were classified based on their medical topics (e.g.,Allergic Diseases, immunology, therapeutic and diagnostic approaches) and cognitive levels according to Bloom’s Taxonomy.

Results

GPTo1-preview demonstrated the highest performance with an accuracy rate of 90%, significantly outperforming other LLMs and human participants (p < 0.01). The accuracy rates of the other LLMs were 81% for Claude 3.5 Sonnet, 76% for ChatGPT-4o, 70% for Llama 3.1 405b, and 68% for Gemini 1.5 pro. The average accuracy rate of human participants was 56%. In the Bloom’s Taxonomy analysis, all LLMs except GPTo1-preview showed the lowest performance at the “Application” level. Topic-wise, “Allergic Diseases” emerged as the category with the lowest success rate among all LLMs.

Conclusion

GPTo1-preview has outperformed both other LLMs and human experts, indicating the significant potential of AI in medicine. However, considering the limitations of LLMs and the importance of human expertise, AI should be used as a supportive tool in the medical field, with further research required to clarify both its capabilities and limitations.