Analisi comparativa dei modelli linguistici di grandi dimensioni per la verifica di conformità della documentazione safety-critical nel settore ferroviario

Mengi, Aleyna <1999>

View/Open

tesi36830902.pdf (1.660Mb)

allegato368309021.pdf (239.2Kb)

Author

Mengi, Aleyna <1999>

Date

2026-02-18

Data available

2026-02-26

Abstract

La valutazione della conformità tra la documentazione safety-critical e gli standard internazionali è un processo lungo, che richiede l’analisi di grandi volumi di informazioni ed è intrinsecamente soggetto a errori dovuti all’intervento umano. Con la crescente diffusione dei Large Language Models (LLM), il loro utilizzo come supporto nella valutazione documentale è diventato sempre più rilevante. Questa tesi analizza le prestazioni di tre sistemi LLM selezionati in base a usabilità, accessibilità e condizioni di privacy: ChatGPT, Microsoft Copilot e un modello open-source distribuito localmente tramite Ollama. ChatGPT e Copilot sono stati scelti per la loro facilità d’uso, mentre la soluzione basata su Ollama è stata implementata su un server chiuso gestito dall’università per garantire un maggiore controllo sull’esposizione dei dati. Lo studio si articola in tre fasi: creazione di un gruppo di controllo umano mediante estrazione dei requisiti dagli standard CENELEC EN 50126-1, EN 50126-2 ed EN 50129; validazione quantitativa dei risultati dei LLM tramite un dataset di riferimento con tre classi di esito (Compliant, Non-compliant, Not found), metriche basate sulla confusion matrix e coefficiente di correlazione di Matthews (MCC); generazione di report di Independent Safety Assessment (ISA) tramite Ollama con few-shot prompting, seguita da revisione qualitativa esperta. I risultati quantitativi mostrano che ChatGPT ottiene il miglior allineamento con i giudizi esperti (accuratezza 83,82%, MCC 0,664). Microsoft Copilot e la configurazione Ollama presentano prestazioni simili (≈60% di accuratezza, MCC ≈0,20), con recall quasi perfetto ma specificità molto bassa, indicando una tendenza sistematica verso classificazioni positive e un elevato numero di falsi positivi. Nel complesso, gli LLM possono ridurre il carico di lavoro degli esperti come strumenti di supporto decisionale; tuttavia, i limiti attuali di calibrazione e controllo dei falsi positivi rendono ancora

The assessment of compliance between safety-critical documentation and international standards is a time-consuming process that requires analysing large volumes of material and is inherently error-prone due to reliance on human judgment. With the growing adoption of Large Language Models (LLMs), their potential to support documentation assessment has become increasingly relevant. This thesis evaluates the compliance-assessment performance of three LLM systems selected based on usability, accessibility, and privacy considerations: ChatGPT, Microsoft Copilot, and a locally deployed open-source model served via Ollama. ChatGPT and Copilot were chosen for their ease of use, while the Ollama-based solution was implemented on a university-managed closed server to ensure stronger data-exposure control. The study was conducted in three stages. First, a human-expert control group was created by extracting requirements from CENELEC EN 50126-1, EN 50126-2, and EN 50129. Second, these requirements were validated through quantitative evaluation of LLM verification results using a reference dataset with three outcome classes, Compliant, Non-compliant, and Not found, together with confusion-matrix-based metrics and the Matthews correlation coefficient (MCC). Third, Independent Safety Assessment (ISA) reports were generated via Ollama using few-shot prompting based on paired Safety Case and ISA report examples, followed by expert qualitative review. Quantitative findings indicate that ChatGPT achieved the strongest alignment with expert judgments (accuracy 83.82%, MCC 0.664). Microsoft Copilot and the Ollama-based system showed similar overall performance (approximately 60% accuracy, MCC ≈ 0.20) with near-perfect recall but extremely low specificity, revealing a systematic bias toward compliant classifications and elevated false-positive rates. Overall, LLMs demonstrate clear potential to reduce expert workload as decision-support tools. However, current limitations in calibration

Type

info:eu-repo/semantics/masterThesis