Named Entity Recognition per anonimizzazione e clustering di documenti.

Sacco, Giulia <2000>

dc.contributor.advisor	Villa, Silvia <1977>
dc.contributor.advisor	Molinari, Cesare <1990>
dc.contributor.author	Sacco, Giulia <2000>
dc.contributor.other	Curzio Basso
dc.date.accessioned	2024-12-19T15:18:01Z
dc.date.available	2024-12-19T15:18:01Z
dc.date.issued	2024-12-11
dc.identifier.uri	https://unire.unige.it/handle/123456789/10666
dc.description.abstract	Questa tesi esplora l'analisi automatica di documenti utilizzando tecniche di Natural Language Processing (NLP) e Machine Learning. l'applicazione si concentra su un dataset di curricula vitae. Uno degli obiettivi principali è l'estrazione di informazioni attraverso il Named Entity Recognition (NER), una tecnica che identifica e classifica entità denominate come persone, organizzazioni, luoghi, date ed espressioni numeriche. Un aspetto cruciale è stato l'anonimizzazione dei dati sensibili, quali nomi, indirizzi e codici fiscali, per garantire la privacy dei candidati e promuovere una selezione più equa, minimizzando pregiudizi nei processi di reclutamento. In aggiunta, la tesi si focalizza sul clustering dei curricula tramite l'algoritmo k-means, al fine di ottenere una descrizione significativa dei profili professionali che compongono il dataset. In questa fase il NER è stato utilizzato per estrarre entità come professioni e ragioni sociali che sono state integrate nel processo di clustering per migliorare l'accuratezza delle categorizzazioni. La tesi è articolata in quattro capitoli: il primo fornisce una base teorica sul supervised learning e sul clustering k-means; il secondo discute le tecniche di rappresentazione del testo, dalle tradizionali rappresentazioni sparse alle moderne tecniche di embedding; il terzo approfondisce i metodi per il NER, dai modelli basati su regole ai più avanzati approcci machine learning; infine, il quarto presenta i risultati delle tecniche di anonimizzazione e clustering ottenuti sul dataset di curricula.	it_IT
dc.description.abstract	This thesis explores the automatic analysis of documents using Natural Language Processing (NLP) and Machine Learning techniques. the application focuses on a dataset of CVs. One of the main objectives is the extraction of information using Named Entity Recognition (NER), a technique that identifies and classifies named entities such as people, organisations, places, dates and numerical expressions. A crucial aspect was the anonymisation of sensitive data, such as names, addresses and tax codes, to ensure the privacy of candidates and promote fairer selection, minimising bias in recruitment processes. In addition, the thesis focuses on the clustering of CVs using the k-means algorithm in order to obtain a meaningful description of the professional profiles that make up the dataset. At this stage, NER was used to extract entities such as occupations and social reasons that were integrated into the clustering process to improve the accuracy of the categorisations. The thesis is divided into four chapters: the first provides a theoretical basis on supervised learning and k-means clustering; the second discusses text representation techniques, from traditional sparse representations to modern embedding techniques; the third delves into methods for NER, from rule-based models to more advanced machine learning approaches; finally, the fourth presents the results of anonymisation and clustering techniques obtained on the curriculum dataset.	en_UK
dc.language.iso	it
dc.rights	info:eu-repo/semantics/openAccess
dc.title	Named Entity Recognition per anonimizzazione e clustering di documenti.	it_IT
dc.title.alternative	Named Entity Recognition for document anonymization and clustering.	en_UK
dc.type	info:eu-repo/semantics/masterThesis
dc.subject.miur	MAT/06 - PROBABILITÀ E STATISTICA MATEMATICA
dc.publisher.name	Università degli studi di Genova
dc.date.academicyear	2023/2024
dc.description.corsolaurea	9011 - MATEMATICA
dc.description.area	7 - SCIENZE MAT.FIS.NAT.
dc.description.department	100021 - DIPARTIMENTO DI MATEMATICA

Files in questo item

Nome:: tesi31560634.pdf
Dimensione:: 903.3Kb
Formato:: PDF

Mostra/Apri

Questo item appare nelle seguenti collezioni

Laurea Magistrale [7007]

Mostra i principali dati dell'item