Agenti generativi con capacità di ragionamento e interfaccia vocale: implementazione e analisi sperimentale su dispositivi edge

Firpo, Pietro <2001>

Mostra/Apri

tesi35233854.pdf (12.24Mb)

Autore

Firpo, Pietro <2001>

Data

2025-10-15

Disponibile dal

2025-10-23

Abstract

Negli ultimi anni, la diffusione dei modelli linguistici (Language Models, LMs) di grandi dimensioni ha abilitato lo sviluppo di sistemi agentici in grado di interagire in linguaggio naturale e agire in modo orientato agli obiettivi. Tuttavia, la maggior parte delle implementazioni attuali opera in cloud, trascurando i vantaggi dell’esecuzione all’edge: maggiore resilienza ai disservizi di rete, minore carico sulle infrastrutture di comunicazione e migliore efficienza energetica. Questo lavoro esplora la possibilità di eseguire localmente, su dispositivi edge, un agente vocale basato su LM. È stata realizzata una pipeline end-to-end che integra tre componenti principali: un modulo di trascrizione vocale (Moonshine), un LM per l’elaborazione linguistica (Granite 3.3) e un sistema di sintesi vocale (Piper). Il prototipo è stato valutato mediante un framework di metriche per l’embedded deployment comprendente utilizzo di memoria, prestazioni, consumo energetico ed efficienza. Sono stati analizzati due casi d’uso — termostato e lavatrice intelligente — rappresentativi di scenari di Agentic Edge AI. I test sono stati condotti su tre piattaforme hardware con diverse caratteristiche in termini di memoria, potenza di calcolo, acceleratori e consumi: Nvidia Jetson AGX Orin, Raspberry Pi 5 e STM32MP257F-EV1. I risultati mostrano che i moduli vocali operano in tempo reale su Jetson e Raspberry, mentre l’inferenza dell’LM rimane il principale collo di bottiglia in termini di latenza, evidenziando la necessità di ottimizzazioni significative per un’interazione fluida. Nel complesso, il lavoro fornisce una caratterizzazione quantitativa e verificabile di un sistema di Agentic AI con interfaccia vocale interamente eseguito all’edge, offrendo un contributo originale e un riferimento concreto per la co-progettazione di architetture hardware e software nei futuri sistemi di Agentic Edge AI.

In recent years, the rise of large-scale Language Models (LMs) has enabled the development of agentic systems capable of natural language interaction and goal-oriented behavior. However, most current implementations operate in the cloud, overlooking the potential advantages of edge execution: greater resilience to network disruptions, reduced communication infrastructure load, and improved energy efficiency. This work investigates the feasibility of running a voice-based LM agent locally on edge devices. An end-to-end pipeline was developed, integrating three main components: a speech-to-text module (Moonshine), a language model for text processing (Granite 3.3), and a text-to-speech system (Piper). The prototype was evaluated using an embedded deployment framework including metrics for memory usage, performance, energy consumption, and overall efficiency. Two use cases — a smart thermostat and a smart washing machine — were proposed as representative examples of Agentic Edge AI. Tests were conducted on three hardware platforms with different characteristics in terms of memory availability, computing power, integrated accelerators, and energy consumption: Nvidia Jetson AGX Orin, Raspberry Pi 5, and STM32MP257F-EV1. Results show that the voice modules operate in real time on the Jetson and Raspberry platforms, while LM inference remains the main latency bottleneck, especially compared to typical human dialogue times, highlighting the need for significant acceleration across all agent components. Overall, this work provides the first quantitative and verifiable characterization of an end-to-end Agentic AI system with a voice interface fully executed on the edge, offering an original contribution and a concrete reference framework for the joint design of hardware and software architectures in future Agentic Edge AI systems.

Tipo

info:eu-repo/semantics/masterThesis