On the Convergence of Adaptive Gradient Methods for Nonconvex Optimization
Zhou, Dongruo, Tang, Yiqi, Yang, Ziyan, Cao, Yuan, Gu, Quanquan
Stochastic gradient descent (SGD) (Robbins and Monro, 1951) and its variants have been widely used in training deep neural networks. Among those variants, adaptive gradient methods (AdaGrad) (Duchi et al., 2011; McMahan and Streeter, 2010), which scale each coordinate of the gradient by a function of past gradients, can achieve better performance than vanilla SGD in practice when the gradients are sparse. An intuitive explanation for the success of AdaGrad is that it automatically adjusts the learning rate for each feature based on the partial gradient, which accelerates the convergence. However, AdaGrad was later found to demonstrate degraded performance especially in cases where the loss function is nonconvex or the gradient is dense, due to rapid decay of learning rate.
Aug-16-2018
- Country:
- Europe > Russia (0.04)
- North America > United States
- Virginia > Albemarle County
- Charlottesville (0.14)
- New Jersey > Mercer County
- Princeton (0.04)
- California > Los Angeles County
- Los Angeles (0.28)
- Virginia > Albemarle County
- Asia
- Russia (0.04)
- Middle East > Jordan (0.04)
- Genre:
- Research Report (0.50)
- Technology: