Fast Mixing of Stochastic Gradient Descent with Normalization and Weight Decay
–Neural Information Processing Systems
We prove the Fast Equilibrium Conjecture proposed by Li et al., (2020), i.e., stochastic gradient descent (SGD) on a scale-invariant loss (e.g., using networks with various normalization schemes) with learning rate \eta and weight decay factor \lambda mixes in function space in \mathcal{\tilde{O}}(\frac{1}{\lambda\eta}) steps, under two standard assumptions: (1) the noise covariance matrix is non-degenerate and (2) the minimizers of the loss form a connected, compact and analytic manifold. The analysis uses the framework of Li et al., (2021) and shows that for every T 0, the iterates of SGD with learning rate \eta and weight decay factor \lambda on the scale-invariant loss converge in distribution in \Theta\left(\eta {-1}\lambda {-1}(T \ln(\lambda/\eta))\right) iterations as \eta\lambda\to 0 while satisfying \eta \le O(\lambda)\le O(1) . Moreover, the evolution of the limiting distribution can be described by a stochastic differential equation that mixes to the same equilibrium distribution for every initialization around the manifold of minimizers as T\to\infty .
Neural Information Processing Systems
Jan-26-2025, 10:07:01 GMT
- Technology: