On the Lipschitz Constant of Deep Networks and Double Descent
Gamba, Matteo, Azizpour, Hossein, Björkman, Mårten
–arXiv.org Artificial Intelligence
A longstanding question towards understanding the remarkable generalization ability of deep networks is characterizing the hypothesis class of models trained in practice, thus isolating properties of the networks' model function that capture generalization (Hanin & Rolnick, 2019; Neyshabur et al., 2015). Chiefly, a central problem is understanding the role played by overparameterization (Arora et al., 2018; Neyshabur et al., 2018; Zhang et al., 2018) - a key design choice of state of the art models - in promoting regularization of the model function. Modern overparameterized networks can achieve good generalization while perfectly interpolating the training set (Nakkiran et al., 2019). This phenomenon is described by the double descent curve of the test error (Belkin et al., 2019; Geiger et al., 2019): as model size increases, the error follows the classical bias-variance trade-off curve (Geman et al., 1992), peaks when a model is large enough to interpolate the training data, and then decreases again as model size grows further.
arXiv.org Artificial Intelligence
Nov-14-2023
- Country:
- North America
- Canada > Ontario
- Toronto (0.14)
- United States > New York (0.14)
- Canada > Ontario
- North America
- Genre:
- Research Report > New Finding (1.00)
- Technology: