Generalization in Deep Networks: The Role of Distance from Initialization
Nagarajan, Vaishnavh, Kolter, J. Zico
Why does training deep neural networks using stochastic gradient descent (SGD) result in a generalization error that does not worsen with the number of parameters in the network? To answer this question, we advocate a notion of effective model capacity that is dependent on {\em a given random initialization of the network} and not just the training algorithm and the data distribution. We provide empirical evidences that demonstrate that the model capacity of SGD-trained deep networks is in fact restricted through implicit regularization of {\em the $\ell_2$ distance from the initialization}. We also provide theoretical arguments that further highlight the need for initialization-dependent notions of model capacity. We leave as open questions how and why distance from initialization is regularized, and whether it is sufficient to explain generalization.
Jan-13-2019
- Country:
- Oceania > Australia
- New South Wales > Sydney (0.04)
- North America > United States
- Pennsylvania > Allegheny County > Pittsburgh (0.14)
- Oceania > Australia
- Genre:
- Research Report (0.64)
- Technology: