Fast Mixing of Stochastic Gradient Descent with Normalization and Weight Decay

Jan-26-2025, 10:07:01 GMT–Neural Information Processing Systems

We prove the Fast Equilibrium Conjecture proposed by Li et al., (2020), i.e., stochastic gradient descent (SGD) on a scale-invariant loss (e.g., using networks with various normalization schemes) with learning rate \eta and weight decay factor \lambda mixes in function space in \mathcal{\tilde{O}}(\frac{1}{\lambda\eta}) steps, under two standard assumptions: (1) the noise covariance matrix is non-degenerate and (2) the minimizers of the loss form a connected, compact and analytic manifold. The analysis uses the framework of Li et al., (2021) and shows that for every T 0, the iterates of SGD with learning rate \eta and weight decay factor \lambda on the scale-invariant loss converge in distribution in \Theta\left(\eta {-1}\lambda {-1}(T \ln(\lambda/\eta))\right) iterations as \eta\lambda\to 0 while satisfying \eta \le O(\lambda)\le O(1) . Moreover, the evolution of the limiting distribution can be described by a stochastic differential equation that mixes to the same equilibrium distribution for every initialization around the manifold of minimizers as T\to\infty .

fast mixing, normalization and weight decay, stochastic gradient descent, (4 more...)

Neural Information Processing Systems

Jan-26-2025, 10:07:01 GMT

Conferences Web Page

Add feedback

Genre:
- Play > Prospect (1.00)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)