Rethinking gradient sparsification as total error minimization
–Neural Information Processing Systems
Gradient compression is a widely-established remedy to tackle the communication bottleneck in distributed training of large deep neural networks (DNNs). Under the error-feedback framework, Top- k sparsification, sometimes with k as little as 0.1% of the gradient size, enables training to the same model quality as the uncompressed case for a similar iteration count. From the optimization perspective, we find that Top- k is the communication-optimal sparsifier given a per-iteration k element budget.We argue that to further the benefits of gradient sparsification, especially for DNNs, a different perspective is necessary -- one that moves from per-iteration optimality to consider optimality for the entire training.We identify that the total error -- the sum of the compression errors for all iterations -- encapsulates sparsification throughout training. Then, we propose a communication complexity model that minimizes the total error under a communication budget for the entire training. We find that the hard-threshold sparsifier, a variant of the Top- k sparsifier with k determined by a constant hard-threshold, is the optimal sparsifier for this model.
Neural Information Processing Systems
Oct-10-2024, 04:39:24 GMT
- Technology: