Goto

Collaborating Authors

 momentum


Towards Understanding Acceleration Tradeoff between Momentum and Asynchrony in Nonconvex Stochastic Optimization

Neural Information Processing Systems

Asynchronous momentum stochastic gradient descent algorithms (Async-MSGD) have been widely used in distributed machine learning, e.g., training large collaborative filtering systems and deep neural networks. Due to current technical limit, however, establishing convergence properties of Async-MSGD for these highly complicated nonoconvex problems is generally infeasible. Therefore, we propose to analyze the algorithm through a simpler but nontrivial nonconvex problems --- streaming PCA. This allows us to make progress toward understanding Aync-MSGD and gaining new insights for more general problems. Specifically, by exploiting the diffusion approximation of stochastic optimization, we establish the asymptotic rate of convergence of Async-MSGD for streaming PCA. Our results indicate a fundamental tradeoff between asynchrony and momentum: To ensure convergence and acceleration through asynchrony, we have to reduce the momentum (compared with Sync-MSGD). To the best of our knowledge, this is the first theoretical attempt on understanding Async-MSGD for distributed nonconvex stochastic optimization. Numerical experiments on both streaming PCA and training deep neural networks are provided to support our findings for Async-MSGD.


50a074e6a8da4662ae0a29edde722179-AuthorFeedback.pdf

Neural Information Processing Systems

In order to help clarify our contributions and or-2 ganize them for readers, we provide the following table to summarize the differences between regrets.3 REVIEWER 4 Thank you for your comments. Concept drift occurs when the optimal model attimetmay no longer bethe optimal model10 at timet+1. Consider an online learning problem with concept drift withT = 3 time periods and loss functions:11 f1(x) = (x 1)2,f2(x) = (x 2)2,f3(x) = (x 3)2. Figure 1: SGD online with momentum Theoretical motivation via Calibration: A more formal motivation of our regret23 can be related to the concept of calibration [1]. The comment on line 110 can be24 rewritten as: If the updates{x1,,xT} are well-calibrated, then perturbingxt by25 anyucannot substantially reduce the cumulative loss.Hence, itcan besaid that the26 sequence {x1,,xT} is asymptotically calibrated with respect to{f1,,fT} if:27 Weindeedranexperiments usingSGDwithmomentum forvariousdecayparameters andconcluded thatSGDwith36 momentum is not even as stable as SGD-online (standard SGD without momentum) as shown in Figure 1.



1e5cff01121223de917a84a242de30a5-Paper-Conference.pdf

Neural Information Processing Systems

InOrMo, momentum isincorporated into ASGD byorganizing the gradients in order based on their iteration indexes. We theoretically prove the convergence of OrMo with both constant and delay-adaptive learning rates for non-convexproblems.