Gradient Descent with Early Stopping is Provably Robust to Label Noise for Overparameterized Neural Networks
Li, Mingchen, Soltanolkotabi, Mahdi, Oymak, Samet
Deep neural networks (DNN) are ubiquitous in a growing number of domains ranging from computer vision to healthcare. State-of-the-art DNN models are typically overparameterized and contain more parameters than the size of the training dataset. It is well understood that in this overparameterized regime, DNNs are highly expressive and have the capacity to (over)fit arbitrary training datasets including pure noise [56]. Mysteriously however neural network models trained via simple algorithms such as stochastic gradient descent continue to predict well on yet unseen test data. In such over-parametrized scenarios there maybe infinitely many globally optimal network parameters consistent with the training data, the key challenge is to understand which network parameters (stochastic) gradient descent converges to and what are its properties. Indeed, a recent series of papers [16, 52, 56], suggest that solutions found by first order methods tend to have favorable generalization properties. As DNNs begin to be deployed in safety critical applications, the need for foundational understanding of their noise robustness and their unique prediction capabilities intensifies. This paper focuses on an intriguing phenomena: overparameterized neural networks are surprisingly robust to label noise when first order methods with early stopping is used to train them [25]. To observe this phenomena consider Figure 1 where we perform experiments on the MNIST data set.
Apr-7-2019
- Country:
- North America > United States > California
- Los Angeles County > Los Angeles (0.28)
- Riverside County > Riverside (0.14)
- North America > United States > California
- Genre:
- Research Report > New Finding (0.68)
- Industry:
- Health & Medicine (0.34)
- Technology: