Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow ReLU networks
Despite the extensive empirical success of deep networks, their o ptimization and generalization properties are still not well understood. Recently, the neural tangent kern el (NTK) has provided the following insight into the problem. In the infinite-width limit, the NTK converges to a limit ing kernel which stays constant during training; on the other hand, when the width is large enough, t he function learned by gradient descent follows the NTK (Jacot et al., 2018). This motivates the study of ov erparameterized networks trained by gradient descent, using properties of the NTK. In fact, paramet ers related to NTK, such as the minimum eigenvalue of the limiting kernel, appear to affect optimization and gen eralization (Arora et al., 2019). However, in addition to such NTK-dependent parameters, prior wo rk also requires the width to depend polynomially on n, 1 /δ or 1 /ǫ, where n denotes the size of the training set, δ denotes the failure probability, and ǫ denotes the target error. These large widths far exceed what is u sed empirically, constituting a significant gap between theory and practice.
Sep-26-2019
- Country:
- North America > United States
- Illinois > Champaign County
- Urbana (0.04)
- California > Santa Clara County
- Palo Alto (0.04)
- Illinois > Champaign County
- Europe > United Kingdom
- England > Cambridgeshire > Cambridge (0.04)
- North America > United States
- Genre:
- Research Report (0.64)
- Technology: