Momentum-Based Variance Reduction in Non-Convex SGD

Oct-10-2024, 19:29:17 GMT–Neural Information Processing Systems

Variance reduction has emerged in recent years as a strong competitor to stochastic gradient descent in non-convex problems, providing the first algorithms to improve upon the converge rate of stochastic gradient descent for finding first-order critical points. However, variance reduction techniques typically require carefully tuned learning rates and willingness to use excessively large "mega-batches" in order to achieve their improved results. We present a new algorithm, STORM, that does not require any batches and makes use of adaptive learning rates, enabling simpler implementation and less hyperparameter tuning. Our technique for removing the batches uses a variant of momentum to achieve variance reduction in non-convex optimization. On smooth losses F, STORM finds a point x with \mathbb{E}[\ abla F(x)\ ]\le O(1/\sqrt{T} \sigma {1/3}/T {1/3}) in T iterations with \sigma 2 variance in the gradients, matching the best-known rate but without requiring knowledge of \sigma .

momentum-based variance reduction, non-convex sgd, stochastic gradient descent, (1 more...)

Neural Information Processing Systems

Oct-10-2024, 19:29:17 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)