TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning
–Neural Information Processing Systems
High network communication cost for synchronizing gradients and parameters is the well-known bottleneck of distributed training. In this work, we propose TernGrad that uses ternary gradients to accelerate distributed deep learning in data parallelism. Our approach requires only three numerical levels {-1,0,1}, which can aggressively reduce the communication time. We mathematically prove the convergence of TernGrad under the assumption of a bound on gradients. Guided by the bound, we propose layer-wise ternarizing and gradient clipping to improve its convergence.
Neural Information Processing Systems
Nov-21-2025, 15:47:51 GMT
- Technology: