Studio INTERNIST: Artificial INTelligent consultant; a pilot study of multiscore Evaluation of GPT-4 Responses to complex cliNical cases submitted by IndividualS with differenT medical knowledge background

Ballabio, Massimo <1999>

dc.contributor.advisor	Zoppoli, Gabriele <1981>
dc.contributor.author	Ballabio, Massimo <1999>
dc.contributor.other	Nicolò Gilardi
dc.date.accessioned	2024-07-04T14:20:06Z
dc.date.available	2025-07-04
dc.date.issued	2024-07-01
dc.identifier.uri	https://unire.unige.it/handle/123456789/8522
dc.description.abstract	Background: Negli ultimi anni, i Large Language Models (LLM), basati su reti neurali artificiali complesse addestrate a generare dialoghi simili a quelli umani, sono stati sempre più oggetto di studio nel campo medico. Mentre i modelli di Intelligenza Artificiale (IA) capaci di interpretare immagini radiologiche sono ampiamente utilizzati, gli LLM potrebbero trovare una rilevante applicazione in Medicina Interna (MI), specialmente per affrontare casi clinici complessi. Obiettivo: Questo studio esamina come l'educazione medica di un individuo che presenta un quesito clinico a ChatGPT-4 influenzi le risposte dell'IA, valutando il suo potenziale nel supportare il processo decisionale clinico. Metodi: Lo studio ha coinvolto tre gruppi di partecipanti con differenti background: studenti di medicina del terzo anno, specializzandi del terzo anno in MI e specialisti in MI con 5-15 anni di esperienza. Ai partecipanti sono stati presentati cinque casi clinici complessi, selezionati da “NEJM Case Challenges” e preformattati con GPT Builders, da cui creare una personale sintesi da sottoporre a ChatGPT-4, considerata al pari di un collega consulente. All’IA è stato richiesto di fornire le cinque diagnosi più probabili. Gli output dell'IA sono stati valutati da tre internisti esperti, utilizzando uno scoring system sviluppato a tale scopo. Risultati: Gli specializzandi hanno ottenuto un punteggio significativamente più alto rispetto agli studenti di medicina del terzo anno (OR 2.33, CI 1.27-4.28, P 0.007). Un trend simile è stato osservato per gli specialisti, sebbene non sia stata raggiunta la significatività statistica (OR 1.42, CI 0.77-2.61, P 0.258). Conclusioni: Lo studio pilota ha rigettato l'ipotesi nulla secondo cui non vi è alcuna differenza nell'uso di GPT-4 per l'assistenza diagnostica in base al livello di istruzione. In particolare, si desume che gli specializzandi traggano il maggior beneficio dall’IA in termini di performance diagnostica.	it_IT
dc.description.abstract	Background: In recent years, Large Language Models (LLMs), based on complex artificial neural networks pre-trained to produce human-like dialogue, have been increasingly explored in the medical field. While AI models interpreting radiological images are widely used for diagnosis, LLMs have yet to find a relevant application in the field of Internal Medicine (IM), especially for addressing challenging and complex clinical cases. Aim: The present work examines how the medical education of individuals presenting a clinical query to ChatGPT-4 affects the accuracy and relevance of AI’s responses, aiming to assess its potential to support clinical decision-making. Methods. The study involved three groups of participants representing varied educational backgrounds: third-year medical students, third-year IM residents and IM specialists with 5 to 15 years of consultancy experience. Participants were presented with five complex clinical cases sourced from the NEJM Case Challenges, standardized with a custom GPT Builder. They provided a personalized overview of each case and presented it to ChatGPT-4 as if discussing it with a colleague for clinical insights. ChatGPT-4 was asked to prompt the five most probable diagnoses based on the information provided. AI outputs were evaluated by three blinded experienced internists using a scoring system specifically developed for this purpose. Results: Residents achieved a significantly higher score compared to third-year students (OR 2.33, CI 1.27-4.28, P 0.007). A similar trend was observed for attending physicians, though it did not reach statistical significance (OR 1.42, CI 0.77-2.61, P 0.258). Conclusions: The pilot study rejected the null hypothesis that there is no difference in the use of GPT-4 for diagnostic assistance based on the level of education. Specifically, it appears that residents derive the greatest benefit in terms of diagnostic ranking.	en_UK
dc.language.iso	en
dc.rights	info:eu-repo/semantics/embargoedAccess
dc.title	Studio INTERNIST: Artificial INTelligent consultant; a pilot study of multiscore Evaluation of GPT-4 Responses to complex cliNical cases submitted by IndividualS with differenT medical knowledge background	it_IT
dc.title.alternative	The INTERNIST study: Artificial INTelligent consultant; a pilot study of multiscore Evaluation of GPT-4 Responses to complex cliNical cases submitted by IndividualS with differenT medical knowledge background	en_UK
dc.type	info:eu-repo/semantics/masterThesis
dc.publisher.name	Università degli studi di Genova
dc.date.academicyear	2023/2024
dc.description.corsolaurea	8745 - MEDICINA E CHIRURGIA
dc.description.area	6 - MEDICINA E CHIRURGIA
dc.description.department	100007 - DIPARTIMENTO DI MEDICINA INTERNA E SPECIALITÀ MEDICHE

Files in this item

Name:: tesi28913612.pdf
Size:: 1.897Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Laurea Magistrale [6938]

Show simple item record