sgdm
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
- Europe > Russia (0.04)
- Asia > Russia (0.04)
- Asia > China > Zhejiang Province > Hangzhou (0.04)
Birder: Communication-Efficient 1-bit Adaptive Optimizer for Practical Distributed DNN Training
Therefore, from a system-level perspective, the design ethos of a system-efficient communication-compression algorithm is that we should guarantee that the compression/decompression of the algorithm is computationally light and takes less time, and it should also be friendly to efficient collective communication primitives.
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
- Europe > Russia (0.04)
- Asia > Russia (0.04)
- Asia > China > Zhejiang Province > Hangzhou (0.04)
On the Provable Suboptimality of Momentum SGD in Nonstationary Stochastic Optimization
Sahu, Sharan, Hogan, Cameron J., Wells, Martin T.
In this paper, we provide a comprehensive theoretical analysis of Stochastic Gradient Descent (SGD) and its momentum variants (Polyak Heavy-Ball and Nesterov) for tracking time-varying optima under strong convexity and smoothness. Our finite-time bounds reveal a sharp decomposition of tracking error into transient, noise-induced, and drift-induced components. This decomposition exposes a fundamental trade-off: while momentum is often used as a gradient-smoothing heuristic, under distribution shift it incurs an explicit drift-amplification penalty that diverges as the momentum parameter $β$ approaches 1, yielding systematic tracking lag. We complement these upper bounds with minimax lower bounds under gradient-variation constraints, proving this momentum-induced tracking penalty is not an analytical artifact but an information-theoretic barrier: in drift-dominated regimes, momentum is unavoidably worse because stale-gradient averaging forces systematic lag. Our results provide theoretical grounding for the empirical instability of momentum in dynamic settings and precisely delineate regime boundaries where vanilla SGD provably outperforms its accelerated counterparts.
- Asia > Middle East > UAE (0.04)
- North America > United States > New York > Tompkins County > Ithaca (0.04)
- North America > United States > Georgia > Fulton County > Atlanta (0.04)
- (3 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.54)
An Improved Analysis of Stochastic Gradient Descent with Momentum
SGD with momentum (SGDM) has been widely applied in many machine learning tasks, and it is often applied with dynamic stepsizes and momentum weights tuned in a stagewise manner. Despite of its empirical advantage over SGD, the role of momentum is still unclear in general since previous analyses on SGDM either provide worse convergence bounds than those of SGD, or assume Lipschitz or quadratic objectives, which fail to hold in practice. Furthermore, the role of dynamic parameters has not been addressed. In this work, we show that SGDM converges as fast as SGD for smooth objectives under both strongly convex and nonconvex settings. We also prove that multistage strategy is beneficial for SGDM compared to using fixed parameters. Finally, we verify these theoretical claims by numerical experiments.
- Europe > Russia (0.04)
- Asia > Russia (0.04)
- Asia > Middle East > Jordan (0.04)
- (2 more...)
- Europe > Russia (0.04)
- Asia > Russia (0.04)
- Asia > Middle East > Jordan (0.04)
- (2 more...)
A Proofs
A.1 Nonconvex stochastic optimization We give proofs of the theorems in section 3. We first give some lemmas. Following the proofs in [60], we introduce the definition of a supermartingale. Since r (0 .5, 1), it follows that the number of iterations N needed is at most O ( null To prove Theorem 5, we first prove the following lemma. Suppose that Assumptions 1 and 2 hold. When neither regularization nor damping is used, i.e.