Katyusha: The First Direct Acceleration of Stochastic Gradient Methods
In large-scale machine learning, the number of data examples is usually very large. To search for the optimal solution, one often uses stochastic gradient methods which only require one (or a small batch of) random example(s) per iteration in order to form an estimator of the full gradient. While full-gradient based methods can enjoy an accelerated (and optimal) convergence rate if Nesterov's momentum trick is used [35-37], theory for stochastic gradient methods are generally lagging behind and less is known for their acceleration. At a high level, momentum is dangerous if stochastic gradients are present. If some gradient estimator is very inaccurate, then adding it to the momentum and moving further in this direction (for every future iteration) may hurt the convergence performance. In other words, when naively equipped with momentum, stochastic gradient methods are "very prune to error accumulation" [25] and do not yield accelerated convergence rates in general.
May-2-2017