Acceleration of stochastic gradient descent with momentum by averaging: finite-sample rates and asymptotic normality

Tang, Kejie, Liu, Weidong, Zhang, Yichen

arXiv.org Artificial Intelligence 

SGD is a first-order optimization algorithm that approximates the expected loss by averaging the loss function over a mini-batch of training examples. At each iteration, the algorithm updates the model parameters in the direction of the negative gradient of the mini-batch loss, scaled by a learning rate parameter. While SGD is simple and easy to implement, it may suffer from slow convergence rates or oscillations in high-dimensional optimization problems, particularly when the loss function is illconditioned or noisy. Momentum-based methods enhance SGD by introducing an exponentially weighted moving average of the past gradients to the update rule, which serves to dampen oscillations and accelerate convergence. In particular, the momentum term introduces a form of inertia to the update process, allowing the algorithm to maintain a more consistent direction of movement even in the presence of noisy gradients. Several variants of momentum-based SGD have been proposed, such as Nesterov's accelerated gradient (NAG), Adagrad, and Adam, each with its own strengths and weaknesses.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found