Studio INTERNIST: Artificial INTelligent consultant; a pilot study of multiscore Evaluation of GPT-4 Responses to complex cliNical cases submitted by IndividualS with differenT medical knowledge background

Ballabio, Massimo <1999>

View/Open

tesi28913612.pdf (1.897Mb)

Author

Ballabio, Massimo <1999>

Date

2024-07-01

Data available

2025-07-04

Abstract

Background: Negli ultimi anni, i Large Language Models (LLM), basati su reti neurali artificiali complesse addestrate a generare dialoghi simili a quelli umani, sono stati sempre più oggetto di studio nel campo medico. Mentre i modelli di Intelligenza Artificiale (IA) capaci di interpretare immagini radiologiche sono ampiamente utilizzati, gli LLM potrebbero trovare una rilevante applicazione in Medicina Interna (MI), specialmente per affrontare casi clinici complessi. Obiettivo: Questo studio esamina come l'educazione medica di un individuo che presenta un quesito clinico a ChatGPT-4 influenzi le risposte dell'IA, valutando il suo potenziale nel supportare il processo decisionale clinico. Metodi: Lo studio ha coinvolto tre gruppi di partecipanti con differenti background: studenti di medicina del terzo anno, specializzandi del terzo anno in MI e specialisti in MI con 5-15 anni di esperienza. Ai partecipanti sono stati presentati cinque casi clinici complessi, selezionati da “NEJM Case Challenges” e preformattati con GPT Builders, da cui creare una personale sintesi da sottoporre a ChatGPT-4, considerata al pari di un collega consulente. All’IA è stato richiesto di fornire le cinque diagnosi più probabili. Gli output dell'IA sono stati valutati da tre internisti esperti, utilizzando uno scoring system sviluppato a tale scopo. Risultati: Gli specializzandi hanno ottenuto un punteggio significativamente più alto rispetto agli studenti di medicina del terzo anno (OR 2.33, CI 1.27-4.28, P 0.007). Un trend simile è stato osservato per gli specialisti, sebbene non sia stata raggiunta la significatività statistica (OR 1.42, CI 0.77-2.61, P 0.258). Conclusioni: Lo studio pilota ha rigettato l'ipotesi nulla secondo cui non vi è alcuna differenza nell'uso di GPT-4 per l'assistenza diagnostica in base al livello di istruzione. In particolare, si desume che gli specializzandi traggano il maggior beneficio dall’IA in termini di performance diagnostica.

Background: In recent years, Large Language Models (LLMs), based on complex artificial neural networks pre-trained to produce human-like dialogue, have been increasingly explored in the medical field. While AI models interpreting radiological images are widely used for diagnosis, LLMs have yet to find a relevant application in the field of Internal Medicine (IM), especially for addressing challenging and complex clinical cases. Aim: The present work examines how the medical education of individuals presenting a clinical query to ChatGPT-4 affects the accuracy and relevance of AI’s responses, aiming to assess its potential to support clinical decision-making. Methods. The study involved three groups of participants representing varied educational backgrounds: third-year medical students, third-year IM residents and IM specialists with 5 to 15 years of consultancy experience. Participants were presented with five complex clinical cases sourced from the NEJM Case Challenges, standardized with a custom GPT Builder. They provided a personalized overview of each case and presented it to ChatGPT-4 as if discussing it with a colleague for clinical insights. ChatGPT-4 was asked to prompt the five most probable diagnoses based on the information provided. AI outputs were evaluated by three blinded experienced internists using a scoring system specifically developed for this purpose. Results: Residents achieved a significantly higher score compared to third-year students (OR 2.33, CI 1.27-4.28, P 0.007). A similar trend was observed for attending physicians, though it did not reach statistical significance (OR 1.42, CI 0.77-2.61, P 0.258). Conclusions: The pilot study rejected the null hypothesis that there is no difference in the use of GPT-4 for diagnostic assistance based on the level of education. Specifically, it appears that residents derive the greatest benefit in terms of diagnostic ranking.

Type

info:eu-repo/semantics/masterThesis