Goto

Collaborating Authors

 Dana, Léo


Convergence of Shallow ReLU Networks on Weakly Interacting Data

arXiv.org Machine Learning

Understanding the properties of models used in machine learning is crucial for providing guarantees to downstream users. Of particular importance, the convergence of the training process under gradient methods stands as one of the first issues to address in order to comprehend them. If, on the one hand, such a question for linear models and convex optimization problems (Bottou et al., 2018; Bach, 2024) are well understood, this is not the case for neural networks, which are the most used models in large-scale machine learning. This paper focuses on providing quantitative convergence guarantees for a one-hidden-layer neural network. Theoretically, such global convergence analysis of neural networks has seen two main achievements in the past years: (i) the identification of the lazy regime, due to a particular initialization, where convergence is always guaranteed at the cost of being essentially a linear model (Jacot et al., 2018; Arora et al., 2019; Chizat et al., 2019), and (ii) the proof that with an infinite amount of hidden units a two-layer neural network converges towards the global minimizer of the loss (Mei et al., 2018; Chizat and Bach, 2018; Rotskoff and Vanden-Eijnden, 2018). However, neural networks are trained in practice outside of these regimes, as neural networks are known to perform feature learning, and experimentally reach global minimum with a large but finite number of neurons. Quantifying in which regimes neural networks converge to a global minimum of their loss is still an important open question. 1


Memorization in Attention-only Transformers

arXiv.org Artificial Intelligence

Recent research has explored the memorization capacity of multi-head attention, but these findings are constrained by unrealistic limitations on the context size. We present a novel proof for language-based Transformers that extends the current hypothesis to any context size. Our approach improves upon the state-of-the-art by achieving more effective exact memorization with an attention layer, while also introducing the concept of approximate memorization of distributions. Through experimental validation, we demonstrate that our proposed bounds more accurately reflect the true memorization capacity of language models, and provide a precise comparison with prior work.