On the Convergence of AdaGrad with Momentum for Training Deep Neural Networks
Adaptive stochastic gradient descent methods, such as AdaGrad, Adam, AdaDelta, Nadam, AMSGrad, \textit{etc.}, have been demonstrated efficacious in solving non-convex stochastic optimization, such as training deep neural networks. However, their convergence rates have not been touched under the non-convex stochastic circumstance except recent breakthrough results on AdaGrad \cite{ward2018adagrad} and perturbed AdaGrad \cite{li2018convergence}. In this paper, we propose two new adaptive stochastic gradient methods called AdaHB and AdaNAG which integrate coordinate-wise AdaGrad with heavy ball momentum and Nesterov accelerated gradient momentum, respectively. The $\mathcal{O}(\frac{\log{T}}{\sqrt{T}})$ non-asymptotic convergence rates of AdaHB and AdaNAG in non-convex stochastic setting are also jointly characterized by leveraging a newly developed unified formulation of these two momentum mechanisms. In particular, when momentum term vanishes we obtain convergence rate of coordinate-wise AdaGrad in non-convex stochastic setting as a byproduct.
Aug-10-2018
- Country:
- Europe > Russia (0.04)
- North America > United States
- New York > Suffolk County > Stony Brook (0.04)
- Asia
- Russia (0.04)
- China > Guangdong Province
- Shenzhen (0.04)
- Genre:
- Research Report > New Finding (0.48)
- Technology: