How To Make the Gradients Small Stochastically: Even Faster Convex and Nonconvex SGD

Allen-Zhu, Zeyuan

Neural Information Processing Systems 

Stochastic gradient descent (SGD) gives an optimal convergence rate when minimizing convex stochastic objectives $f(x)$. However, in terms of making the gradients small, the original SGD does not give an optimal rate, even when $f(x)$ is convex. If $f(x)$ is convex, to find a point with gradient norm $\varepsilon$, we design an algorithm SGD3 with a near-optimal rate $\tilde{O}(\varepsilon {-2})$, improving the best known rate $O(\varepsilon {-8/3})$. This is no slower than the best known stochastic version of Newton's method in all parameter regimes. Papers published at the Neural Information Processing Systems Conference.