On the Provable Suboptimality of Momentum SGD in Nonstationary Stochastic Optimization

Sahu, Sharan, Hogan, Cameron J., Wells, Martin T.

Jan-30-2026–arXiv.org Machine Learning

In this paper, we provide a comprehensive theoretical analysis of Stochastic Gradient Descent (SGD) and its momentum variants (Polyak Heavy-Ball and Nesterov) for tracking time-varying optima under strong convexity and smoothness. Our finite-time bounds reveal a sharp decomposition of tracking error into transient, noise-induced, and drift-induced components. This decomposition exposes a fundamental trade-off: while momentum is often used as a gradient-smoothing heuristic, under distribution shift it incurs an explicit drift-amplification penalty that diverges as the momentum parameter $β$ approaches 1, yielding systematic tracking lag. We complement these upper bounds with minimax lower bounds under gradient-variation constraints, proving this momentum-induced tracking penalty is not an analytical artifact but an information-theoretic barrier: in drift-dominated regimes, momentum is unavoidably worse because stale-gradient averaging forces systematic lag. Our results provide theoretical grounding for the empirical instability of momentum in dynamic settings and precisely delineate regime boundaries where vanilla SGD provably outperforms its accelerated counterparts.

artificial intelligence, machine learning, provable suboptimality, (14 more...)

arXiv.org Machine Learning

Jan-30-2026

arXiv.org PDF

Add feedback

Country:
- Europe > Russia (0.04)
- North America > United States
  - New York > Tompkins County
    - Ithaca (0.04)
  - Georgia > Fulton County
    - Atlanta (0.04)
  - California > San Diego County
    - San Diego (0.04)
- Asia
  - Middle East > UAE (0.04)
  - Russia (0.04)

Genre:
- Research Report > New Finding (0.34)

Industry:
- Education (0.45)

Technology:
- Information Technology > Artificial Intelligence
  - Representation & Reasoning (1.00)
  - Machine Learning
    - Neural Networks > Deep Learning (0.67)
    - Statistical Learning > Gradient Descent (0.54)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found