Minimum $\ell_{1}$-norm interpolators: Precise asymptotics and multiple descent

Li, Yue, Wei, Yuting

arXiv.org Machine Learning 

At the core of statistical learning lies the problem of understanding the generalization performance (e.g., out-of-sample errors) of the learning algorithms in use. Conventional wisdom in statistics held that including too many covariates when training statistical models can hurt generalization (despite improving training accuracy), due to the undesired over-fit. This leads to the classical conclusion that: proper regularization -- through either adding certain penalty functions to the loss function or algorithmic self-regularization -- seems to be critical in achieving the desired accuracy (e.g., Friedman et al. (2001); Wei et al. (2019)). However, an evolving line of works in machine learning observes empirical evidence that suggests, to the surprise of many statisticians, over-parameterization is not necessarily harmful. Indeed, many machine learning models (such as random forests or deep neural networks) are trained until the training error vanishes to zero -- meaning that they are able to perfectly interpolate the data -- while still generalizing well (e.g., Zhang et al. (2021); Wyner et al. (2017); Belkin et al. (2019)). As a key observation to explain this phenomenon, many models when trained by gradient type methods (e.g., gradient descent, stochastic gradient descent, AdaBoost) converge to certain minimum norm interpolators, which implicitly favor models with smaller model complexity. These empirical mysteries inspire a recent flurry of activity towards understanding the generalization properties of various interpolators.