More is Better in Modern Machine Learning: when Infinite Overparameterization is Optimal and Overfitting is Obligatory

Simon, James B., Karkada, Dhruva, Ghosh, Nikhil, Belkin, Mikhail

arXiv.org Machine Learning 

In our era of enormous neural networks, empirical progress has been driven by the philosophy that more is better. Recent deep learning practice has found repeatedly that larger model size, more data, and more computation (resulting in lower training loss) improves performance. In this paper, we give theoretical backing to these empirical observations by showing that these three properties hold in random feature (RF) regression, a class of models equivalent to shallow networks with only the last layer trained. Concretely, we first show that the test risk of RF regression decreases monotonically with both the number of features and the number of samples, provided the ridge penalty is tuned optimally. In particular, this implies that infinite width RF architectures are preferable to those of any finite width. We then proceed to demonstrate that, for a large class of tasks characterized by powerlaw eigenstructure, training to near-zero training loss is obligatory: near-optimal performance can only be achieved when the training error is much smaller than the test error. Grounding our theory in real-world data, we find empirically that standard computer vision tasks with convolutional neural tangent kernels clearly fall into this class. Taken together, our results tell a simple, testable story of the benefits of overparameterization, overfitting, and more data in random feature models. It is an empirical fact that more is better in modern machine learning. State-of-the-art models are commonly trained with as many parameters and for as many iterations as compute budgets allow, often with little regularization. This ethos of enormous, underregularized models contrasts sharply with the received wisdom of classical statistics, which suggests small, parsimonious models and strong regularization to make training and test losses similar. The development of new theoretical results consistent with the success of overparameterized, underregularized modern machine learning has been a central goal of the field for some years. How might such theoretical results look? Consider the well-tested observation that wider networks virtually always achieve better performance, so long as they are properly tuned (Kaplan et al., 2020; Hoffmann et al., 2022; Yang et al., 2022). Such a result would do much to bring deep learning theory up to date with practice. In this work, we take a first step towards this general result by proving it in the special case of RF regression -- that is, for shallow networks with only the second layer trained. Our Theorem 1 states that, for RF regression, more features (as well as more data) is better, and thus infinite width is best.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found