Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow ReLU networks

Ji, Ziwei, Telgarsky, Matus

arXiv.org Machine Learning 

Despite the extensive empirical success of deep networks, their o ptimization and generalization properties are still not well understood. Recently, the neural tangent kern el (NTK) has provided the following insight into the problem. In the infinite-width limit, the NTK converges to a limit ing kernel which stays constant during training; on the other hand, when the width is large enough, t he function learned by gradient descent follows the NTK (Jacot et al., 2018). This motivates the study of ov erparameterized networks trained by gradient descent, using properties of the NTK. In fact, paramet ers related to NTK, such as the minimum eigenvalue of the limiting kernel, appear to affect optimization and gen eralization (Arora et al., 2019). However, in addition to such NTK-dependent parameters, prior wo rk also requires the width to depend polynomially on n, 1 /δ or 1 /ǫ, where n denotes the size of the training set, δ denotes the failure probability, and ǫ denotes the target error. These large widths far exceed what is u sed empirically, constituting a significant gap between theory and practice.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found