In this expository note we describe a surprising phenomenon in overparameterized linear regression, where the dimension exceeds the number of samples: there is a regime where the test risk of the estimator found by gradient descent increases with additional samples. In other words, more data actually hurts the estimator. This behavior is implicit in a recent line of theoretical works analyzing "double-descent" phenomenon in linear models. In this note, we isolate and understand this behavior in an extremely simple setting: linear regression with isotropic Gaussian covariates. In particular, this occurs due to an unconventional type of bias-variance tradeoff in the overparameterized regime: the bias decreases with more samples, but variance increases.
The question of generalization in machine learning---how algorithms are able to learn predictors from a training sample to make accurate predictions out-of-sample---is revisited in light of the recent breakthroughs in modern machine learning technology. The classical approach to understanding generalization is based on bias-variance trade-offs, where model complexity is carefully calibrated so that the fit on the training sample reflects performance out-of-sample. However, it is now common practice to fit highly complex models like deep neural networks to data with (nearly) zero training error, and yet these interpolating predictors are observed to have good out-of-sample accuracy even for noisy data. How can the classical understanding of generalization be reconciled with these observations from modern machine learning practice? In this paper, we bridge the two regimes by exhibiting a new "double descent" risk curve that extends the traditional U-shaped bias-variance curve beyond the point of interpolation. Specifically, the curve shows that as soon as the model complexity is high enough to achieve interpolation on the training sample---a point that we call the "interpolation threshold"---the risk of suitably chosen interpolating predictors from these models can, in fact, be decreasing as the model complexity increases, often below the risk achieved using non-interpolating models. The double descent risk curve is demonstrated for a broad range of models, including neural networks and random forests, and a mechanism for producing this behavior is posited.
We review recent observations on the dynamical systems induced by gradient descent methods used for training deep networks and summarize properties of the solutions they converge to. Recent results illuminate the absence of overfitting in the special case of linear networks for binary classification. They prove that minimization of loss functions such as the logistic, the cross-entropy and the exponential loss yields asymptotic convergence to the maximum margin solution for linearly separable datasets, independently of the initial conditions. Here we discuss the case of nonlinear DNNs near zero minima of the empirical loss, under exponential-type and square losses, for several variations of the basic gradient descent algorithm, including a new NMGD (norm minimizing gradient descent) version that converges to the minimum norm fixed points of the gradient descent iteration. Our main results are: 1) gradient descent algorithms with weight normalization constraint achieve generalization; 2) the fundamental reason for the effectiveness of existing weight normalization and batch normalization techniques is that they are approximate implementations of maximizing the margin under unit norm constraint; 3) without unit norm constraints some level of generalization can still be obtained for not-too-deep networks because the balance of the weights across different layers, if present at initialization, is maintained by the gradient flow. In the perspective of these theoretical results, we discuss experimental evidence around the apparent absence of overfitting, that is the observation that the expected classification error does not get worse when increasing the number of parameters. Our explanation focuses on the implicit normalization enforced by algorithms such as batch normalization. In particular, the control of the norm of the weights is related to Halpern iterations for minimum norm solutions.
In classification problems (i.e. when the labels Y are discrete), the scaling of the test error with respect to n is determined by characterizations of the VC-dimension /Rademacher complexity  of the function class, which in the worst case increases with its number of parameters. In regression (i.e. when the labels Y are continuous), the mean-squared error of the ordinary least-squares estimate is characterized by the condition number of the regression matrix, which is reasonable for appropriate ratios of d/n but tends to increase astronomically as d approaches n. The qualitative fear is the same: if the function class is too complex, it starts to overfit noise and can generalize poorly to unseen test data. But there is a gap between "can" and "will" -- and indeed this conventional wisdom has been challenged by the recent advent of deeper and deeper neural networks. In particular, a thought-provoking paper  noted that several deep neural networks generalize well despite achieving zero or close to zero training error, and being so expressive that they even have the ability to fit pure noise. As they put it, "understanding deep learning requires rethinking generalization". How can we reconcile the fact that good interpolative solutions exist with the classical bias-variance tradeoff? These phenomena are being actively investigated in a statistical sense [5,6] and a computational sense [7-9] in classification problems and/or noiseless models.
Many modern machine learning models are trained to achieve zero or near-zero training error in order to obtain near-optimal (but non-zero) test error. This phenomenon of strong generalization performance for ``overfitted'' / interpolated classifiers appears to be ubiquitous in high-dimensional data, having been observed in deep networks, kernel machines, boosting and random forests. Their performance is consistently robust even when the data contain large amounts of label noise. Very little theory is available to explain these observations. The vast majority of theoretical analyses of generalization allows for interpolation only when there is little or no label noise. This paper takes a step toward a theoretical foundation for interpolated classifiers by analyzing local interpolating schemes, including geometric simplicial interpolation algorithm and singularly weighted $k$-nearest neighbor schemes. Consistency or near-consistency is proved for these schemes in classification and regression problems. Moreover, the nearest neighbor schemes exhibit optimal rates under some standard statistical assumptions. Finally, this paper suggests a way to explain the phenomenon of adversarial examples, which are seemingly ubiquitous in modern machine learning, and also discusses some connections to kernel machines and random forests in the interpolated regime.