Reviews: Adaptive Methods for Nonconvex Optimization

Neural Information Processing Systems 

Bounds are given for the expected gradient of an ergodic average of the iterates produced by the algorithms applied to an L-smooth function, and these bounds converge to zero with time. The authors give several numerical results showing that their algorithm has state-of-the-art performance for different problems. In addition, they achieve this performance with little tuning, unlike in the classical SGD. A motivation behind their work is a paper [27] that shows that a recent adaptive algorithm, ADAM, can fail to converge even for simple convex problems, when the batch size is kept fix.