We show that a variety of modern deep learning tasks exhibit a "double-descent" phenomenon where, as we increase model size, performance first gets worse and then gets better. Moreover, we show that double descent occurs not just as a function of model size, but also as a function of the number of training epochs. We unify the above phenomena by defining a new complexity measure we call the effective model complexity and conjecture a generalized double descent with respect to this measure. Furthermore, our notion of model complexity allows us to identify certain regimes where increasing (even quadrupling) the number of train samples actually hurts test performance. Right: Test error, shown for varying train epochs. All models trained using Adam for 4K epochs. The bias-variance tradeoff is a fundamental concept in classical statistical learning theory (e.g., Hastie et al. (2005)). The idea is that models of higher complexity have lower bias but higher variance. According to this theory, once model complexity passes a certain threshold, models "overfit" with the variance term dominating the test error, and hence from this point onward, increasing model complexity will only decrease performance (i.e., increase test error). Hence conventional wisdom in classical statistics is that, once we pass a certain threshold, "larger models are worse. Such networks have millions of parameters, more than enough to fit even random labels (Zhang et al. (2016)), and yet they perform much better on many tasks than smaller models. Indeed, conventional wisdom among practitioners is that "larger models are better' ' (Krizhevsky et al. (2012), Huang et al. (2018), Szegedy et al.
This video explores the recent study on Deep Double Descent! This is an interesting phenomenon of increasing and decreasing test error with respect to scaling up model size, training data, and epochs. The lens of bias-variance tradeoff may lead you to interpret increasing test error as overfitting, however the double descent phenomenon shows remarkable cases of a second descent when continuing to increase model size / epoch count. Awareness of this phenomenon may help you interpret your training curves for hyperparameter optimization and architectures search!
Breakthroughs in machine learning are rapidly changing science and society, yet our fundamental understanding of this technology has lagged far behind. Indeed, one of the central tenets of the field, the bias-variance trade-off, appears to be at odds with the observed behavior of methods used in the modern machine learning practice. The bias-variance trade-off implies that a model should balance under-fitting and over-fitting: rich enough to express underlying structure in data, simple enough to avoid fitting spurious patterns. However, in the modern practice, very rich models such as neural networks are trained to exactly fit (i.e., interpolate) the data. Classically, such models would be considered over-fit, and yet they often obtain high accuracy on test data.
Now, what was the Gradient Descent algorithm? Above algorithm says, to perform the GD, we need to calculate the gradient of the cost function J. And to calculate the gradient of the cost function, we need to sum (yellow circle!) the cost of each sample. If we have 3 million samples, we have to loop through 3 million times or use the dot product. If you insist to use GD.
In this expository note we describe a surprising phenomenon in overparameterized linear regression, where the dimension exceeds the number of samples: there is a regime where the test risk of the estimator found by gradient descent increases with additional samples. In other words, more data actually hurts the estimator. This behavior is implicit in a recent line of theoretical works analyzing "double-descent" phenomenon in linear models. In this note, we isolate and understand this behavior in an extremely simple setting: linear regression with isotropic Gaussian covariates. In particular, this occurs due to an unconventional type of bias-variance tradeoff in the overparameterized regime: the bias decreases with more samples, but variance increases.