Rethinking gradient sparsification as total error minimization

Apr-25-2026, 15:45:54 GMT–Neural Information Processing Systems

Gradient compression is a widely-established remedy to tackle the communication bottleneck in distributed training of large deep neural networks (DNNs). Under the error-feedback framework, Top-k sparsification, sometimes with k as little as 0.1% of the gradient size, enables training to the same model quality as the uncompressed case for a similar iteration count. From the optimization perspective, we find that Top-k is the communication-optimal sparsifier given a per-iteration k element budget. We argue that to further the benefits of gradient sparsification, especially for DNNs, a different perspective is necessary -- one that moves from per-iteration optimality to consider optimality for the entire training. We identify that the total error -- the sum of the compression errors for all iterations -- encapsulates sparsification throughout training.

artificial intelligence, compressor, machine learning, (16 more...)

Neural Information Processing Systems

Apr-25-2026, 15:45:54 GMT

Conferences PDF

Add feedback

Country:
- North America > Canada (0.28)

Technology:
- Information Technology > Artificial Intelligence
  - Representation & Reasoning (1.00)
  - Machine Learning
    - Statistical Learning (1.00)
    - Neural Networks > Deep Learning (0.89)

Duplicate Docs Excel Report

Title
Rethinking gradient sparsification as total error minimization

Similar Docs Excel Report more

Title	Similarity	Source
None found