Flexible Clustering Parallelo

Pastorino, Edoardo <1999>

View/Open

tesi26654510.pdf (6.491Mb)

Author

Pastorino, Edoardo <1999>

Date

2023-12-13

Data available

2023-12-21

Abstract

La tesi presenta la parallelizzazione di un algoritmo di clustering all'avanguardia, il FISHDBC. Questo obiettivo è stato raggiunto migliorando la creazione delle principali strutture dati e componenti dell'algoritmo: l'HNSW, una struttura dati basata su grafo utilizzata nella ricerca approssimativa dei nearest neighbors; l'MST, un albero che attraversa tutti i vertici nel grafo minimizzando il peso totale degli archi; il clustering HDBSCAN, progettato per eseguire il clustering robusto dei punti dati in base alla loro densità. Il mio contributo si basa su un'implementazione parallela con memoria condivisa e senza lock, resa possibile perché FISHDBC fornisce una soluzione approssimata e offre buone prestazioni. È importante notare che l'algoritmo di Flexible Clustering Parallelo è completamente scritto in Python, senza dipendenze da altri linguaggi. Questa rappresenta una caratteristica importante che lo rende facile da usare e altamente personalizzabile, considerando che le metriche di distanza definite dall'utente, per calcolare la similarità tra i dati, sono per lo più scritte in questo linguaggio.

The thesis presents the parallelization of a state-of-the-art clustering algorithm, the FISHDBC. This target has been achieved by improving the creation of the main data structures and components of the algorithm: the HNSW, a graph-based data structure used in approximate nearest neighbor search; the MST, a tree that spans all the vertices in the graph while minimizing the total weight of the edges; the HDBSCAN clustering, designed to perform robust clustering of data points based on their density. My contribution is based on a lock-free strategy parallel implementation with shared memory, made feasible because FISHDBC provides an approximated solution, and provides good performance figures. It is worth noting that the Parallel Flexible Clustering algorithm is completely written in Python, without dependencies on other languages. This represents an important feature making it user-friendly and highly customizable, considering that user-defined distance metrics, for computing similarity among data, are mostly written in this language.

Type

info:eu-repo/semantics/masterThesis