Goto

Collaborating Authors

 qsparse-local-sgd




CSER: Communication-efficient SGD with Error Reset

Neural Information Processing Systems

In recent years, the sizes of both machine-learning models and datasets have been increasing rapidly. To accelerate the training, it is common to distribute the computation on multiple machines.





CSER: Communication-efficient SGD with Error Reset

Xie, Cong, Zheng, Shuai, Koyejo, Oluwasanmi, Gupta, Indranil, Li, Mu, Lin, Haibin

arXiv.org Machine Learning

The scalability of Distributed Stochastic Gradient Descent (SGD) is today limited by communication bottlenecks. We propose a novel SGD variant: Communication-efficient SGD with Error Reset, or CSER. The key idea in CSER is first a new technique called "error reset" that adapts arbitrary compressors for SGD, producing bifurcated local models with periodic reset of resulting local residual errors. Second we introduce partial synchronization for both the gradients and the models, leveraging advantages from them. We prove the convergence of CSER for smooth non-convex problems. Empirical results show that when combined with highly aggressive compressors, the CSER algorithms: i) cause no loss of accuracy, and ii) accelerate the training by nearly $10\times$ for CIFAR-100, and by $4.5\times$ for ImageNet.


Qsparse-local-SGD: Distributed SGD with Quantization, Sparsification, and Local Computations

Basu, Debraj, Data, Deepesh, Karakus, Can, Diggavi, Suhas

arXiv.org Machine Learning

Communication bottleneck has been identified as a significant issue in distributed optimization of large-scale learning models. Recently, several approaches to mitigate this problem have been proposed, including different forms of gradient compression or computing local models and mixing them iteratively. In this paper we propose \emph{Qsparse-local-SGD} algorithm, which combines aggressive sparsification with quantization and local computation along with error compensation, by keeping track of the difference between the true and compressed gradients. We propose both synchronous and asynchronous implementations of \emph{Qsparse-local-SGD}. We analyze convergence for \emph{Qsparse-local-SGD} in the \emph{distributed} setting for smooth non-convex and convex objective functions. We demonstrate that \emph{Qsparse-local-SGD} converges at the same rate as vanilla distributed SGD for many important classes of sparsifiers and quantizers. We use \emph{Qsparse-local-SGD} to train ResNet-50 on ImageNet, and show that it results in significant savings over the state-of-the-art, in the number of bits transmitted to reach target accuracy.