Revisit last-iterate convergence of mSGD under milder requirement on step size
–Neural Information Processing Systems
Understanding convergence of SGD-based optimization algorithms can help deal with enormous machine learning problems. To ensure last-iterate convergence of SGD and momentum-based SGD (mSGD), the existing studies usually constrain the step size \epsilon_{n} to decay as \sum_{n 1} { \infty}\epsilon_{n} {2} \infty, which however is rather conservative and may lead to slow convergence in the early stage of the iteration. In this paper, we relax this requirement by studying an alternate step size for the mSGD. This implies that a larger step size, such as \epsilon_{n} \frac{1}{\sqrt{n}} can be utilized for accelerating the mSGD in the early stage. Under this new step size and some common conditions, we prove that the gradient norm of mSGD for non-convex loss functions asymptotically decays to zero.
Neural Information Processing Systems
Jan-19-2025, 06:01:37 GMT
- Technology: