A high-resolution dynamical view on momentum methods for over-parameterized neural networks
Liu, Xin, Tao, Wei, Wang, Jun, Pan, Zhisong
–arXiv.org Artificial Intelligence
Due to the simplicity and efficiency of the first-order gradient method, it has been widely used in training neural networks. Although the optimization problem of the neural network is non-convex, recent research has proved that the first-order method is capable of attaining a global minimum for training over-parameterized neural networks, where the number of parameters is significantly larger than that of training instances. Momentum methods, including heavy ball method (HB) and Nesterov's accelerated method (NAG), are the workhorse first-order gradient methods owning to their accelerated convergence. In practice, NAG often exhibits better performance than HB. However, current research fails to distinguish their convergence difference in training neural networks. Motivated by this, we provide convergence analysis of HB and NAG in training an over-parameterized two-layer neural network with ReLU activation, through the lens of high-resolution dynamical systems and neural tangent kernel (NTK) theory. Compared to existing works, our analysis not only establishes tighter upper bounds of the convergence rate for both HB and NAG, but also characterizes the effect of the gradient correction term, which leads to the acceleration of NAG over HB. Finally, we validate our theoretical result on three benchmark datasets.
arXiv.org Artificial Intelligence
Oct-27-2022