AITopics | compression operator

Tight analyses of first-order methods with error feedback

Neural Information Processing SystemsJun-23-2026, 01:20:39 GMT

Communication between agents often constitutes a major computational bottleneck in distributed learning. One of the most common mitigation strategies is to compress the information exchanged, thereby reducing communication overhead. To counteract the degradation in convergence associated with compressed communication, error feedback schemes--most notably EF and EF21--were introduced. In this work, we provide a tight analysis of both of these methods. Specifically, we find the Lyapunov function that yields the best possible convergence rate for each method--with matching lower bounds.

machine learning, natural language, programming language, (18 more...)

Neural Information Processing Systems

Country: Europe > France (0.28)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language (0.68)
Information Technology > Software > Programming Languages (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.67)
(2 more...)

Add feedback

Muon Does Not Converge on Convex Lipschitz Functions

Parshakova, Tetiana, Khaled, Ahmed, Crawshaw, Michael, Garrigos, Guillaume, Gower, Robert M.

arXiv.org Machine LearningMay-12-2026

Muon and its variants have shown strong empirical performance in a variety of deep learning tasks. Existing convergence analyses of Muon rely on smoothness assumptions, though arguably the most successful function class for developing deep learning methods (such as AdaGrad, Shampoo, Schedule-Free and more) has been the class of convex and Lipschitz functions. In this paper we question whether the classical convex Lipschitz model is a useful one for understanding Muon. Our answer is no. We show that Muon does not converge on the class of convex and Lipschitz functions, regardless of the choice of learning rate schedule. We also show that error feedback restores convergence of Muon and all the non-Euclidean subgradient methods with momentum. However, this theoretical fix using error feedback degrades the performance of Muon in two representative settings for image classification (CIFAR-10) and language modeling (nanoGPT on FineWeb-Edu 10B). Our conclusion is that convex Lipschitz theory, despite having a prominent role in the design of practical methods for deep learning, is not the most suited one for Muon. This suggests that Muon's success must come from structure absent from this model, most plausibly related to smoothness.

artificial intelligence, machine learning, muon, (16 more...)

arXiv.org Machine Learning

2605.0898

Genre: Research Report > New Finding (0.66)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Qsparse-local-SGD: Distributed SGD with Quantization, Sparsification and Local Computations

Debraj Basu, Deepesh Data, Can Karakus, Suhas Diggavi

Neural Information Processing SystemsFeb-14-2026, 07:31:30 GMT

Neural Information Processing Systems http://nips.cc/

local computation, qsparse-local-sgd, sparsification, (12 more...)

Neural Information Processing Systems

Country:

Asia > Middle East > Jordan (0.04)
North America > Canada (0.04)

Industry: Government (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.47)

Add feedback

cd86c6a804d925c4cbc5a7b96843f6d5-Paper-Conference.pdf

Neural Information Processing SystemsFeb-11-2026, 23:26:54 GMT

communication compression, compression, optimization, (14 more...)

Neural Information Processing Systems

Country:

North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
Europe > Russia (0.04)
Asia > Russia (0.04)

Technology:

Information Technology > Communications > Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.68)

Add feedback

7274a60c83145b1082be9caa91926ecf-Supplemental.pdf

Neural Information Processing SystemsFeb-9-2026, 08:45:58 GMT

algorithm, canit, compression, (14 more...)

Neural Information Processing Systems

Country:

Europe > Russia (0.04)
Asia > Russia (0.04)

Genre: Research Report > New Finding (0.68)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)

Add feedback

7274a60c83145b1082be9caa91926ecf-Paper.pdf

Neural Information Processing SystemsFeb-9-2026, 08:45:55 GMT

arxiv preprint arxiv, canita, compression, (12 more...)

Neural Information Processing Systems

Country:

Europe > Russia (0.04)
Asia > Russia (0.04)

Genre: Research Report > New Finding (0.68)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)

Add feedback

Theoretically Better and Numerically Faster Distributed Optimization with Smoothness-Aware Quantization Techniques

Neural Information Processing SystemsDec-24-2025, 02:23:44 GMT

To address the high communication costs of distributed machine learning, a large body of work has been devoted in recent years to designing various compression strategies, such as sparsification and quantization, and optimization algorithms capable of using them. Recently, Safaryan et al. (2021) pioneered a dramatically different compression design approach: they first use the local training data to form local smoothness matrices and then propose to design a compressor capable of exploiting the smoothness information contained therein. While this novel approach leads to substantial savings in communication, it is limited to sparsification as it crucially depends on the linearity of the compression operator. In this work, we generalize their smoothness-aware compression strategy to arbitrary unbiased compression operators, which also include sparsification. Specializing our results to stochastic quantization, we guarantee significant savings in communication complexity compared to standard quantization. In particular, we prove that block quantization with $n$ blocks theoretically outperforms single block quantization, leading to a reduction in communication complexity by an $\mathcal{O}(n)$ factor, where $n$ is the number of nodes in the distributed system. Finally, we provide extensive numerical evidence with convex optimization problems that our smoothness-aware quantization strategies outperform existing quantization schemes as well as the aforementioned smoothness-aware sparsification strategies with respect to three evaluation metrics: the number of iterations, the total amount of bits communicated, and wall-clock time.

optimization, quantization, smoothness-aware quantization technique, (8 more...)

Neural Information Processing Systems

Genre: Research Report (0.59)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.76)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.59)

Add feedback

Tight analyses of first-order methods with error feedback

Thomsen, Daniel Berg, Taylor, Adrien, Dieuleveut, Aymeric

arXiv.org Artificial IntelligenceNov-4-2025

Communication between agents often constitutes a major computational bottleneck in distributed learning. One of the most common mitigation strategies is to compress the information exchanged, thereby reducing communication overhead. To counteract the degradation in convergence associated with compressed communication, error feedback schemes -- most notably $\mathrm{EF}$ and $\mathrm{EF}^{21}$ -- were introduced. In this work, we provide a tight analysis of both of these methods. Specifically, we find the Lyapunov function that yields the best possible convergence rate for each method -- with matching lower bounds. This principled approach yields sharp performance guarantees and enables a rigorous, apples-to-apples comparison between $\mathrm{EF}$, $\mathrm{EF}^{21}$, and compressed gradient descent. Our analysis is carried out in the simplified single-agent setting, which allows for clean theoretical insights and fair comparison of the underlying mechanisms.

artificial intelligence, lyapunov function, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2506.05271

Country: Europe > France (0.28)

Genre: