On the Convergence of Adaptive Gradient Methods for Nonconvex Optimization

Zhou, Dongruo, Tang, Yiqi, Yang, Ziyan, Cao, Yuan, Gu, Quanquan

Aug-16-2018–arXiv.org Machine Learning

Stochastic gradient descent (SGD) (Robbins and Monro, 1951) and its variants have been widely used in training deep neural networks. Among those variants, adaptive gradient methods (AdaGrad) (Duchi et al., 2011; McMahan and Streeter, 2010), which scale each coordinate of the gradient by a function of past gradients, can achieve better performance than vanilla SGD in practice when the gradients are sparse. An intuitive explanation for the success of AdaGrad is that it automatically adjusts the learning rate for each feature based on the partial gradient, which accelerates the convergence. However, AdaGrad was later found to demonstrate degraded performance especially in cases where the loss function is nonconvex or the gradient is dense, due to rapid decay of learning rate.

artificial intelligence, deep learning, machine learning, (17 more...)

arXiv.org Machine Learning

Aug-16-2018

arXiv.org PDF

Add feedback

Country:
- Europe > Russia (0.04)
- North America > United States
  - Virginia > Albemarle County
    - Charlottesville (0.14)
  - New Jersey > Mercer County
    - Princeton (0.04)
  - California > Los Angeles County
    - Los Angeles (0.28)
- Asia
  - Russia (0.04)
  - Middle East > Jordan (0.04)

Genre:
- Research Report (0.50)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning
  - Statistical Learning > Gradient Descent (0.55)
  - Neural Networks > Deep Learning (0.48)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found