The Convergence of Sparsified Gradient Methods

Alistarh, Dan, Hoefler, Torsten, Johansson, Mikael, Konstantinov, Nikola, Khirirat, Sarit, Renggli, Cedric

Feb-14-2020, 17:42:11 GMT–Neural Information Processing Systems

Distributed training of massive machine learning models, in particular deep neural networks, via Stochastic Gradient Descent (SGD) is becoming commonplace. Several families of communication-reduction methods, such as quantization, large-batch methods, and gradient sparsification, have been proposed. To date, gradient sparsification methods--where each node sorts gradients by magnitude, and only communicates a subset of the components, accumulating the rest locally--are known to yield some of the largest practical gains. Such methods can reduce the amount of communication per step by up to \emph{three orders of magnitude}, while preserving model accuracy. Yet, this family of methods currently has no theoretical justification.

convergence, magnitude, sparsified gradient method

Neural Information Processing Systems

Feb-14-2020, 17:42:11 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Machine Learning
  - Statistical Learning > Gradient Descent (0.63)
  - Neural Networks (0.63)