Una prospettiva Matematica su NLP

Fresu, Luca <2000>

Mostra/Apri

tesi37472902.pdf (1.066Mb)

Autore

Fresu, Luca <2000>

Data

2026-03-25

Disponibile dal

2026-04-02

Abstract

The thesis is structured as follows. Chapter 1 is devoted to discrete-time Markov chains, where we introduce their fundamental properties, including state classification, recurrence and transience, invariant measures, and asymptotic behavior. Chapter 2 presents the foundations of supervised learning within the framework of statistical learning theory. We introduce empirical risk minimization and reproducing kernel Hilbert spaces (RKHS). We derive the closed-form solution of Kernel Ridge Regression. Chapter 3 formulates text generation as a Markov chain problem, where tokens from a vocabulary are mapped to states and the sequence of words in a corpus is modeled as a discrete-time Markov chain. The transition matrix is estimated via maximum likelihood from empirical transition counts, and its statistical validity is supported by the ergodic theorem, which ensures consistency of the estimator. Higher-order language models are implemented by lifting the state space so that a k-order chain becomes a first-order chain on tuples of tokens. Chapter 4 combines these two perspectives by applying Kernel Ridge Regression to next-word prediction in natural language. We model language as a first-order Markov chain and show how the ergodic theorem resolves the violation of the i.i.d. assumption, enabling rigorous generalization guarantees. We present a complete Python implementation using the TinyStories dataset, including: - Word embedding via pre-trained Word2Vec models - Kernel matrix computation with Linear and Gaussian kernels - Training via gradient descent with early stopping - Quantitative evaluation and text generation experiments

The thesis is structured as follows: Chapter 1 is devoted to discrete-time Markov chains, where we introduce their fundamental prop erties, including state classification, recurrence and transience, invariant measures, and asymptotic be havior. Chapter 2presents the foundations of supervised learning within the framework of statistical learning theory. We introduce empirical risk minimization and reproducing kernel Hilbert spaces (RKHS). We derive the closed-form solution of Kernel Ridge Regression. Chapter 3 formulates text generation as a Markov chain problem, where tokens from a vocabulary are mapped to states and the sequence of words in a corpus is modeled as a discrete-time Markov chain. The transition matrix is estimated via maximum likelihood from empirical transition counts, and its statistical validity is supported by the ergodic theorem, which ensures consistency of the estimator. Higher-order language models are implemented by lifting the state space so that a k-order chain becomes a first-order chain on tuples of tokens. Chapter 4 combines these two perspectives by applying Kernel Ridge Regression to next-word pre diction in natural language. We model language as a first-order Markov chain and show how the ergodic theorem resolves the violation of the i.i.d. assumption, enabling rigorous generalization guarantees. We present a complete Python implementation using the TinyStories dataset, including: Word embedding via pre-trained Word2Vec models Kernel matrix computation with Linear and Gaussian kernels Training via gradient descent with early stopping Quantitative evaluation and text generation experiments

Tipo

info:eu-repo/semantics/masterThesis