Apollo: An Adaptive Parameter-wise Diagonal Quasi-Newton Method for Nonconvex Stochastic Optimization

Ma, Xuezhe

arXiv.org Machine Learning 

Importantly, the update and storage of the diagonal approximation of Hessian is as efficient as adaptive first-order optimization methods with linear complexity for both time and memory. To handle nonconvexity, we replace the Hessian with its rectified absolute value, which is guaranteed to be positive-definite. Nonconvex stochastic optimization is of core practical importance in many fields of machine learning, in particular for training deep neural networks (DNNs). First-order gradient-based optimization algorithms, conceptually attractive due to their linear efficiency on both the time and memory complexity, have led to tremendous progress and impressive successes. However, one disadvantage of SGD is that the gradients in different directions are scaled uniformly, resulting in limited convergence speed and sensitive choice of the learning rate, and thus has spawned a lot of recent interest in accelerating SGD from the algorithmic and practical perspectives. Recently, many adaptive first-order optimization methods have been proposed to achieve rapid training progress with element-wise scaled learning rates, and we can only mention a few here due to space limits. In their pioneering work, Duchi et al. (2011) proposed AdaGrad, which scales the gradient by the square root of the accumulative square gradients from the first iteration.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found