Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow ReLU networks

Sep-26-2019–arXiv.org Machine Learning

Despite the extensive empirical success of deep networks, their o ptimization and generalization properties are still not well understood. Recently, the neural tangent kern el (NTK) has provided the following insight into the problem. In the infinite-width limit, the NTK converges to a limit ing kernel which stays constant during training; on the other hand, when the width is large enough, t he function learned by gradient descent follows the NTK (Jacot et al., 2018). This motivates the study of ov erparameterized networks trained by gradient descent, using properties of the NTK. In fact, paramet ers related to NTK, such as the minimum eigenvalue of the limiting kernel, appear to affect optimization and gen eralization (Arora et al., 2019). However, in addition to such NTK-dependent parameters, prior wo rk also requires the width to depend polynomially on n, 1 /δ or 1 /ǫ, where n denotes the size of the training set, δ denotes the failure probability, and ǫ denotes the target error. These large widths far exceed what is u sed empirically, constituting a significant gap between theory and practice.

gradient descent, lemma 2, probability 1, (13 more...)

arXiv.org Machine Learning

Sep-26-2019

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - Illinois > Champaign County
    - Urbana (0.04)
  - California > Santa Clara County
    - Palo Alto (0.04)
- Europe > United Kingdom
  - England > Cambridgeshire > Cambridge (0.04)

Genre:
- Research Report (0.64)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.93)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found