On the Convergence of AdaGrad with Momentum for Training Deep Neural Networks

Aug-10-2018–arXiv.org Machine Learning

Adaptive stochastic gradient descent methods, such as AdaGrad, Adam, AdaDelta, Nadam, AMSGrad, \textit{etc.}, have been demonstrated efficacious in solving non-convex stochastic optimization, such as training deep neural networks. However, their convergence rates have not been touched under the non-convex stochastic circumstance except recent breakthrough results on AdaGrad \cite{ward2018adagrad} and perturbed AdaGrad \cite{li2018convergence}. In this paper, we propose two new adaptive stochastic gradient methods called AdaHB and AdaNAG which integrate coordinate-wise AdaGrad with heavy ball momentum and Nesterov accelerated gradient momentum, respectively. The $\mathcal{O}(\frac{\log{T}}{\sqrt{T}})$ non-asymptotic convergence rates of AdaHB and AdaNAG in non-convex stochastic setting are also jointly characterized by leveraging a newly developed unified formulation of these two momentum mechanisms. In particular, when momentum term vanishes we obtain convergence rate of coordinate-wise AdaGrad in non-convex stochastic setting as a byproduct.

artificial intelligence, deep learning, machine learning, (12 more...)

arXiv.org Machine Learning

Aug-10-2018

arXiv.org PDF

Add feedback

Country:
- Europe > Russia (0.04)
- North America > United States
  - New York > Suffolk County > Stony Brook (0.04)
- Asia
  - Russia (0.04)
  - China > Guangdong Province
    - Shenzhen (0.04)

Genre:
- Research Report > New Finding (0.48)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning
  - Neural Networks > Deep Learning (0.86)
  - Statistical Learning > Gradient Descent (0.81)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found