How To Make the Gradients Small Stochastically
In contrast, the rate of convergence for the gradients, that is, the number of iterations T needed to find a point x with ‖ f(x)‖ ε, is "addressed very rarely" and sometimes requires new algorithmic ideas [18]. In particular, in the full-gradient setting, accelerated gradient descent is only suboptimal for this new goal (at least in the worst case), and additional tricks are needed to get the fastest convergence rate [18]. We review these tricks in Section 1.1. Unfortunately, in the stochastic setting, to the best of our knowledge, tight bounds are not known for finding points with small gradients.
Jan-8-2018