Reviews: The Marginal Value of Adaptive Gradient Methods in Machine Learning

Neural Information Processing Systems 

Adaptive methods are based on metrics which evolve along the optimization process. Contrary to what happens for gradient descent, Nesterov's method or the heavy ball method, this may result in estimates which are outside of the linear span of past visited points and estimated gradients. These methods became very popular recently in a deep learning context. The main question adressed by the authors is to compare both categories of method. First the authors construct an easy classification example for which they prove that adaptive methods behave very badly while non adaptive methods achieve perfect accuracy. Second the authors report extensive numerical comparisons of the different classes of algorithms showing consistent superiority of non adaptive methods.