momentum
Momentum Further Constrains Sharpness at the Edge of Stochastic Stability
Andreyev, Arseniy, Ananthkumar, Advikar, Walden, Marc, Poggio, Tomaso, Beneventano, Pierfrancesco
Recent work suggests that (stochastic) gradient descent self-organizes near an instability boundary, shaping both optimization and the solutions found. Momentum and mini-batch gradients are widely used in practical deep learning optimization, but it remains unclear whether they operate in a comparable regime of instability. We demonstrate that SGD with momentum exhibits an Edge of Stochastic Stability (EoSS)-like regime with batch-size-dependent behavior that cannot be explained by a single momentum-adjusted stability threshold. Batch Sharpness (the expected directional mini-batch curvature) stabilizes in two distinct regimes: at small batch sizes it converges to a lower plateau $2(1-β)/η$, reflecting amplification of stochastic fluctuations by momentum and favoring flatter regions than vanilla SGD; at large batch sizes it converges to a higher plateau $2(1+β)/η$, where momentum recovers its classical stabilizing effect and favors sharper regions consistent with full-batch dynamics. We further show that this aligns with linear stability thresholds and discuss the implications for hyperparameter tuning and coupling.
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
- Asia > Middle East > Jordan (0.04)
- North America > United States > New Jersey > Mercer County > Princeton (0.04)
Towards Understanding Acceleration Tradeoff between Momentum and Asynchrony in Nonconvex Stochastic Optimization
Asynchronous momentum stochastic gradient descent algorithms (Async-MSGD) have been widely used in distributed machine learning, e.g., training large collaborative filtering systems and deep neural networks. Due to current technical limit, however, establishing convergence properties of Async-MSGD for these highly complicated nonoconvex problems is generally infeasible. Therefore, we propose to analyze the algorithm through a simpler but nontrivial nonconvex problems --- streaming PCA. This allows us to make progress toward understanding Aync-MSGD and gaining new insights for more general problems. Specifically, by exploiting the diffusion approximation of stochastic optimization, we establish the asymptotic rate of convergence of Async-MSGD for streaming PCA. Our results indicate a fundamental tradeoff between asynchrony and momentum: To ensure convergence and acceleration through asynchrony, we have to reduce the momentum (compared with Sync-MSGD). To the best of our knowledge, this is the first theoretical attempt on understanding Async-MSGD for distributed nonconvex stochastic optimization. Numerical experiments on both streaming PCA and training deep neural networks are provided to support our findings for Async-MSGD.
50a074e6a8da4662ae0a29edde722179-AuthorFeedback.pdf
In order to help clarify our contributions and or-2 ganize them for readers, we provide the following table to summarize the differences between regrets.3 REVIEWER 4 Thank you for your comments. Concept drift occurs when the optimal model attimetmay no longer bethe optimal model10 at timet+1. Consider an online learning problem with concept drift withT = 3 time periods and loss functions:11 f1(x) = (x 1)2,f2(x) = (x 2)2,f3(x) = (x 3)2. Figure 1: SGD online with momentum Theoretical motivation via Calibration: A more formal motivation of our regret23 can be related to the concept of calibration [1]. The comment on line 110 can be24 rewritten as: If the updates{x1,,xT} are well-calibrated, then perturbingxt by25 anyucannot substantially reduce the cumulative loss.Hence, itcan besaid that the26 sequence {x1,,xT} is asymptotically calibrated with respect to{f1,,fT} if:27 Weindeedranexperiments usingSGDwithmomentum forvariousdecayparameters andconcluded thatSGDwith36 momentum is not even as stable as SGD-online (standard SGD without momentum) as shown in Figure 1.
- North America > United States (0.14)
- North America > Canada (0.04)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- Health & Medicine > Diagnostic Medicine > Imaging (0.69)
- Health & Medicine > Therapeutic Area (0.47)
- Asia > Middle East > Jordan (0.04)
- Europe > France > Île-de-France > Paris > Paris (0.04)
- North America > Canada > Ontario > Toronto (0.14)
- Europe > Switzerland > Zürich > Zürich (0.04)
- Europe > Russia (0.04)
- (2 more...)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- North America > United States > California > Los Angeles County > Long Beach (0.04)
- (14 more...)
- Oceania > Australia (0.04)
- Europe > Romania > Sud - Muntenia Development Region > Giurgiu County > Giurgiu (0.04)
- Asia > Middle East > Jordan (0.04)
- Asia > China > Beijing > Beijing (0.04)