D3.429 - Performance Analysis of Five Large Language Models, Including GPTo1-Preview, On an Allergy and Clinical Immunology Exam

Poster abstract

Background

Advances in artificial intelligence have enhanced the potential of large language models (LLMs) in medical education and assessment. While existing studies primarily focus on general medical exams such as the United States Medical Licensing Examination (USMLE), this study is the first to evaluate the performance of LLMs in a specialized field—namely, allergy and immunology.

Method

In this comparative and cross-sectional study, the performances of five LLMs (ChatGPT-4o, Gemini 1.5 pro, Claude 3.5 Sonnet, Llama 3.1 405b and GPTo1-preview) and 58 expert physician candidates were evaluated in the National Allergy and Clinical Immunology Board Examination. Each participant responded to 100 multiple-choice questions presented in Turkish. The questions were classified based on their medical topics (e.g.,Allergic Diseases, immunology, therapeutic and diagnostic approaches) and cognitive levels according to Bloom’s Taxonomy.

Results

GPTo1-preview demonstrated the highest performance with an accuracy rate of 90%, significantly outperforming other LLMs and human participants (p < 0.01). The accuracy rates of the other LLMs were 81% for Claude 3.5 Sonnet, 76% for ChatGPT-4o, 70% for Llama 3.1 405b, and 68% for Gemini 1.5 pro. The average accuracy rate of human participants was 56%. In the Bloom’s Taxonomy analysis, all LLMs except GPTo1-preview showed the lowest performance at the “Application” level. Topic-wise, “Allergic Diseases” emerged as the category with the lowest success rate among all LLMs.

Conclusion

GPTo1-preview has outperformed both other LLMs and human experts, indicating the significant potential of AI in medicine. However, considering the limitations of LLMs and the importance of human expertise, AI should be used as a supportive tool in the medical field, with further research required to clarify both its capabilities and limitations.

Topic

Artificial Intelligence

D3.429 - Performance Analysis of Five Large Language Models, Including GPTo1-Preview, On an Allergy and Clinical Immunology Exam

Background

Method

Results

Conclusion

Posters from same theme

001129 - From Symptoms to MBPT: A Machine Learning Approach to Asthma Diagnosis

D2.77 - A Novel Neural Network for Quantifying Wheal and Erythema in Skin Prick Testing

D3.429 - Performance Analysis of Five Large Language Models, Including GPTo1-Preview, On an Allergy and Clinical Immunology Exam

D3.430 - Artificial intelligence Chatbots Can Influence the Decision-Making and Behavior of Patients with Allergies: Results of a Multicenter Evaluation Study

D3.431 - Can Artificial Intelligence-Based Large Language Models Provide Accurate and Reliable Information to Asthma Patients? A Comparative Analysis with Expert Insights

D3.136 - Worse asthma control in patients with asthma and respiratory allergies in the CAPTURE study: Rationale for developing the RespiratoryAllergyOptimiser