Role of Momentum in Smoothing Objective Function in Implicit Graduated Optimization

Sato, Naoki, Iiduka, Hideaki

arXiv.org Artificial Intelligence 

Although convergence analyses of SGD with momentum for nonconvex functions have been provided While stochastic gradient descent (SGD) with momentum [11, 16, 39], none of them explain why convergence has fast convergence and excellent generalizability, a is faster than with SGD. The generalizability of SGD with theoretical explanation for this is lacking. In this paper, momentum has been well studied, and various experimental we show that SGD with momentum smooths the objective findings have been reported. While it has been suggested function, the degree of which is determined by the learning that momentum plays a role in reducing stochastic noise rate, the batch size, the momentum factor, the variance of [8, 11], stochastic noise has been shown to increase generalizability the stochastic gradient, and the upper bound of the gradient [18, 37, 59], and it has been claimed that stochastic norm. This theoretical finding reveals why momentum improves noise can help an algorithm escape from local solutions generalizability and provides new insights into the with poor generalizability [10, 15, 20, 28, 32].