Convergence of Shallow ReLU Networks on Weakly Interacting Data

Dana, Léo, Bach, Francis, Pillaud-Vivien, Loucas

arXiv.org Machine Learning 

Understanding the properties of models used in machine learning is crucial for providing guarantees to downstream users. Of particular importance, the convergence of the training process under gradient methods stands as one of the first issues to address in order to comprehend them. If, on the one hand, such a question for linear models and convex optimization problems (Bottou et al., 2018; Bach, 2024) are well understood, this is not the case for neural networks, which are the most used models in large-scale machine learning. This paper focuses on providing quantitative convergence guarantees for a one-hidden-layer neural network. Theoretically, such global convergence analysis of neural networks has seen two main achievements in the past years: (i) the identification of the lazy regime, due to a particular initialization, where convergence is always guaranteed at the cost of being essentially a linear model (Jacot et al., 2018; Arora et al., 2019; Chizat et al., 2019), and (ii) the proof that with an infinite amount of hidden units a two-layer neural network converges towards the global minimizer of the loss (Mei et al., 2018; Chizat and Bach, 2018; Rotskoff and Vanden-Eijnden, 2018). However, neural networks are trained in practice outside of these regimes, as neural networks are known to perform feature learning, and experimentally reach global minimum with a large but finite number of neurons. Quantifying in which regimes neural networks converge to a global minimum of their loss is still an important open question. 1