Algorithmic Stability of Stochastic Gradient Descent with Momentum under Heavy-Tailed Noise

Dang, Thanh, Barsbey, Melih, Sonet, A K M Rokonuzzaman, Gurbuzbalaban, Mert, Simsekli, Umut, Zhu, Lingjiong

Feb-2-2025–arXiv.org Machine Learning

Understanding the generalization properties of optimization algorithms under heavy-tailed noise has gained growing attention. However, the existing theoretical results mainly focus on stochastic gradient descent (SGD) and the analysis of heavy-tailed optimizers beyond SGD is still missing. In this work, we establish generalization bounds for SGD with momentum (SGDm) under heavy-tailed gradient noise. We first consider the continuous-time limit of SGDm, i.e., a Levy-driven stochastic differential equation (SDE), and establish quantitative Wasserstein algorithmic stability bounds for a class of potentially non-convex loss functions. Our bounds reveal a remarkable observation: For quadratic loss functions, we show that SGDm admits a worse generalization bound in the presence of heavy-tailed noise, indicating that the interaction of momentum and heavy tails can be harmful for generalization. We then extend our analysis to discrete-time and develop a uniform-in-time discretization error bound, which, to our knowledge, is the first result of its kind for SDEs with degenerate noise. This result shows that, with appropriately chosen step-sizes, the discrete dynamics retain the generalization properties of the limiting SDE. We illustrate our theory on both synthetic quadratic problems and neural networks.

artificial intelligence, machine learning, min 2, (15 more...)

arXiv.org Machine Learning

Feb-2-2025

arXiv.org PDF

Add feedback

Country:
- Europe (0.46)
- North America > United States (0.46)

Genre:
- Research Report > New Finding (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Statistical Learning
    - Gradient Descent (1.00)
  - Representation & Reasoning (1.00)