D3.430 - Artificial intelligence Chatbots Can Influence the Decision-Making and Behavior of Patients with Allergies: Results of a Multicenter Evaluation Study

Poster abstract

Background

Since the introduction of the ChatGPT language model to the general public almost 2.5 years ago, researchers worldwide have observed not only the rapid expansion of artificial intelligence (AI) applications (with this particular model alone being used by over 200 million people) but also the active development of other competing models. Increasingly, websites and software developers are integrating virtual assistant code to facilitate interactions with users, and medical websites are no exception.

The aim of our study was to determine the potential influence of interactions with two of the most popular AI-based models on the decision-making and care pathways of patients with allergies.

Method

Twelve questions were posed to two of the most popular, latest versions of AI-based models (ChatGPT-4o and Gemini 2.0 Flash). Six of these questions were formulated based on an analysis of Google Trends over the past year in Ukraine across three thematic categories ("allergies", "allergens", "allergy tests"), while the other six were derived from an online survey of doctors regarding the most common questions asked by patients during consultations. Each response was independently evaluated by five competent, highly specialized allergists based on three parameters: Accuracy, Correctness, and Completeness, using a scoring scale ranging from 0 to 3.

The resulting score for each response was the arithmetic sum of the parameter scores; however, a cross-sectional statistical analysis was also conducted.

All questions were conditionally divided into two categories: orientation questions (questions about the essence of processes, diagnostic methods, etc.) and behavioral questions (what to do, where to go, care pathways, etc.).

 

Results

ChatGPT overall proved to be more effective in answering questions across all three parameters (average accuracy +17.2%, average completeness +19.1%, and average correctness +14.6%, with the differences being statistically significant). The average longitudinal accuracy score for all questions was 2.1 points for ChatGPT and 1.7 points for Gemini.

The average longitudinal correctness score was 2.2 points for ChatGPT and 1.8 points for Gemini, while the average longitudinal completeness score was 1.9 points and 1.2 points, respectively.

In line with the study's objective, mathematical modeling was used to analyze responses to orientation and behavioral questions. It was found that responses to behavioral questions were statistically significantly worse across all three parameters compared to responses to orientation questions (p<0.05).

Conclusion

The two most popular artificial intelligence models showed statistically significant differences in the quality of responses to common queries about allergic diseases. However, the quality of responses to the behavioral question block was significantly lower than that of the orientation question block, which, in our opinion, may negatively influence the care pathways and decision-making of patients with allergies at the pre-medical stage.