SlowMo: Improving Communication-Efficient Distributed SGD with Slow Momentum

Wang, Jianyu, Tantia, Vinayak, Ballas, Nicolas, Rabbat, Michael

arXiv.org Machine Learning 

A BSTRACT Distributed optimization is essential for training large models on large datasets. Multiple approaches have been proposed to reduce the communication overhead in distributed training, such as synchronizing only after performing multiple local SGD steps, and decentralized methods ( e.g., using gossip algorithms) to decouple communications among workers. Although these methods run faster than A LLR EDUCEbased methods, which use blocking communication before every update, the resulting models may be less accurate after the same number of updates. Inspired by the BMUF method of Chen & Huo (2016), we propose a slow momentum (S LOWM O) framework, where workers periodically synchronize and perform a momentum update, after multiple iterations of a base optimization algorithm. Experiments on image classification and machine translation tasks demonstrate that S LOWM O consistently yields improvements in optimization and generalization performance relative to the base optimizer, even when the additional overhead is amortized over many updates so that the S LOWM O runtime is on par with that of the base optimizer. We provide theoretical convergence guarantees showing that S LOWM O converges to a stationary point of smooth non-convex losses. Since BMUF is a particular instance of the S LOWM O framework, our results also correspond to the first theoretical convergence guarantees for BMUF. 1 I NTRODUCTION Distributed optimization (Chen et al., 2016; Goyal et al., 2017) is essential for training large models on large datasets (Radford et al., 2019; Liu et al., 2019; Mahajan et al., 2018b). Currently, the most widely-used approaches have workers compute small mini-batch gradients locally, in parallel, and then aggregate these using a blocking communication primitive, A LLR EDUCE, before taking an optimizer step. Communication overhead is a major issue limiting the scaling of this approach, since A LLR EDUCE must complete before every step and blocking communications are sensitive to stragglers (Dutta et al., 2018; Ferdinand et al., 2019). Multiple complementary approaches have recently been investigated to reduce or hide communication overhead. Decentralized training (Jiang et al., 2017; Lian et al., 2017; 2018; Assran et al., 2019) reduces idling due to blocking and stragglers by employing approximate gradient aggregation ( e.g., via gossip or distributed averaging). Approaches such as Local SGD reduce the frequency of communication by having workers perform multiple updates between each round of communication (McDonald et al., 2010; McMahan et al., 2017; Zhou & Cong, 2018; Stich, 2019; Y u et al., 2019b). It is also possible to combine decentralized algorithms with Local SGD (Wang & Joshi, Work performed while doing an internship at Facebook AI Research. 1 arXiv:1910.00643v1

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found