Identifying the Active Speaker in Multiparty Interaction Using
Multimodal Analysis

Saberi, Amirmohammad <1999>

dc.contributor.advisor	Recchiuto, Carmine Tommaso <1984>
dc.contributor.author	Saberi, Amirmohammad <1999>
dc.date.accessioned	2026-01-22T14:12:10Z
dc.date.available	2026-01-22T14:12:10Z
dc.date.issued	2025-12-18
dc.identifier.uri	https://unire.unige.it/handle/123456789/14538
dc.description.abstract	Questa tesi discute l'identificazione dei parlanti in interazioni di gruppo naturali e frenetiche. Questo è fondamentale per i robot sociali che devono percepire e rivolgersi all'individuo appropriato. Le soluzioni basate solo sul suono falliscono quando molte persone parlano contemporaneamente, c'è rumore di fondo o la voce più forte non è rivolta al robot. Per risolvere questi problemi, si suggerisce un approccio multimodale in tempo reale che integri prove uditive e visive. Dopo aver determinato la probabilità di un parlato utilizzando il segnale del microfono, il metodo misura piccoli movimenti della bocca e posizioni della testa utilizzando immagini dal vivo della telecamera e infine combina i due flussi utilizzando un modello decisionale appreso che migliora con l'apprendimento per rinforzo per migliorare l'interazione con il robot. Python è stato utilizzato per costruire la pipeline su Pepper senza sacrificare il tempo di conversazione naturale. Le prestazioni sono state testate su un benchmark audiovisivo pubblico e su registrazioni di laboratorio con un massimo di cinque persone. Il metodo raccomandato supera una baseline basata solo sul suono nell'identificazione del parlante attivo di quasi il 19%. Dopo sovrapposizioni di parlato e una moderata occlusione visiva, mantiene frame rate interattivi. Utenti reali hanno affermato che i robot dotati di questo sistema girano la testa più velocemente e sono più attenti e socievoli. La tesi migliora i robot conversazionali individuando in modo affidabile e rapido l'interlocutore corrente nei gruppi giornalieri. È inoltre di buon auspicio per la futura ricerca sulla gestualità adattiva, l'alternanza dei turni e il coinvolgimento multilaterale uomo-robot.	it_IT
dc.description.abstract	This thesis discusses identifying speakers in natural, fast-paced group interactions. This is crucial for social robots that must perceive and address the proper individual. Sound-only solutions fail when many people talk at once, there’s background noise, or the loudest voice isn’t facing the robot. A real-time multimodal approach that integrates aural and visual evidence is suggested to solve these issues. After determining speech likelihood using the microphone signal, the method measures small mouth movements and head poses using live camera images, and finally combines the two streams using a learned decision model that improves with reinforcement learning to improve robot interaction. Python was used to build the pipeline on Pepper without sacrificing natural conversation time. Performance was tested on a public audio-visual benchmark and lab recordings with up to five persons. The recommended method outperforms a sound-only baseline in identifying the active speaker by almost 19 percent. After speech overlaps and moderate visual occlusion, it maintains interactive frame rates. Real users said robots with the system turn their heads faster and are more attentive and social. The thesis improves conversational robots by reliably and swiftly finding the current speaker in daily groups. It also bodes well for future research on adaptive gesturing, turn-taking, and multiparty humanrobot engagement.	en_UK
dc.language.iso	en
dc.rights	info:eu-repo/semantics/closedAccess
dc.title	Identifying the Active Speaker in Multiparty Interaction Using Multimodal Analysis	it_IT
dc.title.alternative	Identifying the Active Speaker in Multiparty Interaction Using Multimodal Analysis	en_UK
dc.type	info:eu-repo/semantics/masterThesis
dc.subject.miur	ING-INF/04 - AUTOMATICA
dc.publisher.name	Università degli studi di Genova
dc.date.academicyear	2024/2025
dc.description.corsolaurea	10635 - ROBOTICS ENGINEERING
dc.description.area	9 - INGEGNERIA
dc.description.department	100023 - DIPARTIMENTO DI INFORMATICA, BIOINGEGNERIA, ROBOTICA E INGEGNERIA DEI SISTEMI

Files in questo item

Nome:: tesi36112903.pdf
Dimensione:: 2.640Mb
Formato:: PDF

Mostra/Apri

Questo item appare nelle seguenti collezioni

Laurea Magistrale [6938]

Mostra i principali dati dell'item

Identifying the Active Speaker in Multiparty Interaction Using Multimodal Analysis

Files in questo item

Questo item appare nelle seguenti collezioni