momentum
Towards Understanding Acceleration Tradeoff between Momentum and Asynchrony in Nonconvex Stochastic Optimization
Asynchronous momentum stochastic gradient descent algorithms (Async-MSGD) have been widely used in distributed machine learning, e.g., training large collaborative filtering systems and deep neural networks. Due to current technical limit, however, establishing convergence properties of Async-MSGD for these highly complicated nonoconvex problems is generally infeasible. Therefore, we propose to analyze the algorithm through a simpler but nontrivial nonconvex problems --- streaming PCA. This allows us to make progress toward understanding Aync-MSGD and gaining new insights for more general problems. Specifically, by exploiting the diffusion approximation of stochastic optimization, we establish the asymptotic rate of convergence of Async-MSGD for streaming PCA. Our results indicate a fundamental tradeoff between asynchrony and momentum: To ensure convergence and acceleration through asynchrony, we have to reduce the momentum (compared with Sync-MSGD). To the best of our knowledge, this is the first theoretical attempt on understanding Async-MSGD for distributed nonconvex stochastic optimization. Numerical experiments on both streaming PCA and training deep neural networks are provided to support our findings for Async-MSGD.
50a074e6a8da4662ae0a29edde722179-AuthorFeedback.pdf
In order to help clarify our contributions and or-2 ganize them for readers, we provide the following table to summarize the differences between regrets.3 REVIEWER 4 Thank you for your comments. Concept drift occurs when the optimal model attimetmay no longer bethe optimal model10 at timet+1. Consider an online learning problem with concept drift withT = 3 time periods and loss functions:11 f1(x) = (x 1)2,f2(x) = (x 2)2,f3(x) = (x 3)2. Figure 1: SGD online with momentum Theoretical motivation via Calibration: A more formal motivation of our regret23 can be related to the concept of calibration [1]. The comment on line 110 can be24 rewritten as: If the updates{x1,,xT} are well-calibrated, then perturbingxt by25 anyucannot substantially reduce the cumulative loss.Hence, itcan besaid that the26 sequence {x1,,xT} is asymptotically calibrated with respect to{f1,,fT} if:27 Weindeedranexperiments usingSGDwithmomentum forvariousdecayparameters andconcluded thatSGDwith36 momentum is not even as stable as SGD-online (standard SGD without momentum) as shown in Figure 1.
- North America > United States (0.14)
- North America > Canada (0.04)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- Health & Medicine > Diagnostic Medicine > Imaging (0.69)
- Health & Medicine > Therapeutic Area (0.47)
- Asia > Middle East > Jordan (0.04)
- Europe > France > Île-de-France > Paris > Paris (0.04)
- North America > Canada > Ontario > Toronto (0.14)
- Europe > Switzerland > Zürich > Zürich (0.04)
- Europe > Russia (0.04)
- (2 more...)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- North America > United States > California > Los Angeles County > Long Beach (0.04)
- (14 more...)
- Oceania > Australia (0.04)
- Europe > Romania > Sud - Muntenia Development Region > Giurgiu County > Giurgiu (0.04)
- Asia > Middle East > Jordan (0.04)
- Asia > China > Beijing > Beijing (0.04)
- North America > United States > Iowa > Story County > Ames (0.04)
- Europe > Russia (0.04)
- Asia > Russia (0.04)