BEER: Fast O(1/T) Rate for Decentralized Nonconvex Optimization with Communication Compression

Neural Information Processing Systems 

Communication efficiency has been widely recognized as the bottleneck for large-scale decentralized machine learning applications in multi-agent or federated environments. To tackle the communication bottleneck, there have been many efforts to design communication-compressed algorithms for decentralized nonconvex optimization, where the clients are only allowed to communicate a small amount of quantized information (aka bits) with their neighbors over a predefined graph topology. Despite significant efforts, the state-of-the-art algorithm in the nonconvex setting still suffers from a slower rate of convergence O((G/T) {2/3}) compared with their uncompressed counterpart, where G measures the data heterogeneity across different clients, and T is the number of communication rounds. This paper proposes BEER, which adopts communication compression with gradient tracking, and shows it converges at a faster rate of O(1/T) . This significantly improves over the state-of-the-art rate, by matching the rate without compression even under arbitrary data heterogeneity.