Una prospettiva Matematica su NLP

Fresu, Luca <2000>

dc.contributor.advisor	Villa, Silvia <1977>
dc.contributor.advisor	Molinari, Cesare <1990>
dc.contributor.author	Fresu, Luca <2000>
dc.date.accessioned	2026-04-02T14:26:36Z
dc.date.available	2026-04-02T14:26:36Z
dc.date.issued	2026-03-25
dc.identifier.uri	https://unire.unige.it/handle/123456789/15606
dc.description.abstract	The thesis is structured as follows. Chapter 1 is devoted to discrete-time Markov chains, where we introduce their fundamental properties, including state classification, recurrence and transience, invariant measures, and asymptotic behavior. Chapter 2 presents the foundations of supervised learning within the framework of statistical learning theory. We introduce empirical risk minimization and reproducing kernel Hilbert spaces (RKHS). We derive the closed-form solution of Kernel Ridge Regression. Chapter 3 formulates text generation as a Markov chain problem, where tokens from a vocabulary are mapped to states and the sequence of words in a corpus is modeled as a discrete-time Markov chain. The transition matrix is estimated via maximum likelihood from empirical transition counts, and its statistical validity is supported by the ergodic theorem, which ensures consistency of the estimator. Higher-order language models are implemented by lifting the state space so that a k-order chain becomes a first-order chain on tuples of tokens. Chapter 4 combines these two perspectives by applying Kernel Ridge Regression to next-word prediction in natural language. We model language as a first-order Markov chain and show how the ergodic theorem resolves the violation of the i.i.d. assumption, enabling rigorous generalization guarantees. We present a complete Python implementation using the TinyStories dataset, including: - Word embedding via pre-trained Word2Vec models - Kernel matrix computation with Linear and Gaussian kernels - Training via gradient descent with early stopping - Quantitative evaluation and text generation experiments	it_IT
dc.description.abstract	The thesis is structured as follows: Chapter 1 is devoted to discrete-time Markov chains, where we introduce their fundamental prop erties, including state classification, recurrence and transience, invariant measures, and asymptotic be havior. Chapter 2presents the foundations of supervised learning within the framework of statistical learning theory. We introduce empirical risk minimization and reproducing kernel Hilbert spaces (RKHS). We derive the closed-form solution of Kernel Ridge Regression. Chapter 3 formulates text generation as a Markov chain problem, where tokens from a vocabulary are mapped to states and the sequence of words in a corpus is modeled as a discrete-time Markov chain. The transition matrix is estimated via maximum likelihood from empirical transition counts, and its statistical validity is supported by the ergodic theorem, which ensures consistency of the estimator. Higher-order language models are implemented by lifting the state space so that a k-order chain becomes a first-order chain on tuples of tokens. Chapter 4 combines these two perspectives by applying Kernel Ridge Regression to next-word pre diction in natural language. We model language as a first-order Markov chain and show how the ergodic theorem resolves the violation of the i.i.d. assumption, enabling rigorous generalization guarantees. We present a complete Python implementation using the TinyStories dataset, including: Word embedding via pre-trained Word2Vec models Kernel matrix computation with Linear and Gaussian kernels Training via gradient descent with early stopping Quantitative evaluation and text generation experiments	en_UK
dc.language.iso	en
dc.rights	info:eu-repo/semantics/openAccess
dc.title	Una prospettiva Matematica su NLP	it_IT
dc.title.alternative	A Mathematical Perspective on NLP	en_UK
dc.type	info:eu-repo/semantics/masterThesis
dc.subject.miur	MAT/06 - PROBABILITÀ E STATISTICA MATEMATICA
dc.subject.miur	MAT/06 - PROBABILITÀ E STATISTICA MATEMATICA
dc.publisher.name	Università degli studi di Genova
dc.date.academicyear	2024/2025
dc.description.corsolaurea	9011 - MATEMATICA
dc.description.area	7 - SCIENZE MAT.FIS.NAT.
dc.description.department	100021 - DIPARTIMENTO DI MATEMATICA

Files in questo item

Nome:: tesi37472902.pdf
Dimensione:: 1.066Mb
Formato:: PDF

Mostra/Apri

Questo item appare nelle seguenti collezioni

Laurea Magistrale [7402]

Mostra i principali dati dell'item