Metriche Guess-number per la perdita di privacy nei Large Language Models

Haider, Ali <2000>

dc.contributor.advisor	Demetrio, Luca <1993>
dc.contributor.advisor	Dell'Amico, Matteo <1979>
dc.contributor.advisor	Rovetta, Stefano <1966>
dc.contributor.author	Haider, Ali <2000>
dc.date.accessioned	2025-04-03T14:19:34Z
dc.date.available	2025-04-03T14:19:34Z
dc.date.issued	2025-03-27
dc.identifier.uri	https://unire.unige.it/handle/123456789/11816
dc.description.abstract	È noto che i language models sollevano problemi di privacy, poiché i dati confidenziali che possono essere inclusi nei loro dati di addestramento possono essere esposti agli utenti. Carlini et al. hanno proposto una metrica, EXPOSURE, per valutare quanto un determinato dato sensibile diventi prominente a causa della sua inclusione nel training set. Pur essendo utile, Carlini et al. ammettono che tale metrica può essere utilizzata solo come limite inferiore al livello effettivo di perdita di privacy, il che implica che un attaccante potrebbe estrarre un dato sensibile dal language model vittima con meno sforzo rispetto a quanto stimato dalla metrica EXPOSURE. Inoltre, EXPOSURE può essere calcolata soltanto iniettando volontariamente dati sensibili all’interno di un language model e successivamente recuperandoli. Per superare questo problema, proponiamo di utilizzare una metrica differente, ovvero il guess-number ranking, che calcola il numero di tentativi che un attaccante dovrebbe effettuare per trovare un determinato dato sensibile. Il guess ranking può essere calcolato in modo efficiente utilizzando un algoritmo Monte Carlo, ideato per la valutazione della robustezza delle password e recentemente adottato nel contesto degli attacchi side-channel.	it_IT
dc.description.abstract	It is known that language models raise privacy issues, because confidential data that may be included in their training data may be exposed to users. Carlini et al. proposed a metric, EXPOSURE, to evaluate how much a given piece of sensitive data becomes prominent due to its inclusion in the training set. While useful, Carlini et al. admit it can only be used as a lower bound for the actual level of privacy loss, implying that an attacker might leak an intended piece of data from the victim large language model with less effort than the one computed by the EXPOSURE metrics. Also, EXPOSURE can only be computed by willingly injecting sensitive data inside a language model, and later retrieving it. To overcome this issue, we propose to leverage a different metric, that is the guess-number ranking, that computes the number of guesses an attacker needs to try before finding a specific piece of sensitive data. Guess ranking can be computed efficiently using a Monte Carlo algorithm devised for password strength checking and recently adopted in the context of side-channel attacks.	en_UK
dc.language.iso	en
dc.rights	info:eu-repo/semantics/openAccess
dc.title	Metriche Guess-number per la perdita di privacy nei Large Language Models	it_IT
dc.title.alternative	Guess-number Metrics for Privacy Exposure in Large Language Models	en_UK
dc.type	info:eu-repo/semantics/masterThesis
dc.subject.miur	ING-INF/05 - SISTEMI DI ELABORAZIONE DELLE INFORMAZIONI
dc.publisher.name	Università degli studi di Genova
dc.date.academicyear	2023/2024
dc.description.corsolaurea	10852 - COMPUTER SCIENCE
dc.description.area	7 - SCIENZE MAT.FIS.NAT.
dc.description.department	100023 - DIPARTIMENTO DI INFORMATICA, BIOINGEGNERIA, ROBOTICA E INGEGNERIA DEI SISTEMI

Files in this item

Name:: tesi32421744.pdf
Size:: 1.589Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Laurea Magistrale [6129]

Show simple item record