Sketched Adaptive Federated Deep Learning: A Sharp Convergence Analysis

Chen, Zhijie, Li, Qiaobo, Banerjee, Arindam

arXiv.org Artificial Intelligence 

Despite the recent success of federated learning (FL), the cost of communication arguably remains the main challenge. Wang et al. (2023) showed that a 20 Gbps network bandwidth is necessary to bring the communication overhead to a suitable scale for finetuning GPT-J-6B, which is unrealistic in distributed settings. Even with good network conditions, reduction on the communication complexity means one can train much larger models given the same communication budget. The communication cost of FL can be represented as O(dT), where d is the ambient dimension of the parameter space and T is the number of rounds for convergence. Various methods have been proposed to minimize T, e.g., local training (Stich, 2018), large batch training (Xu et al., 2023). Folklores in centralized training regimes suggest that T heavily relies on the choice of optimizers, where adaptive methods usually demonstrate faster convergence and better generalization performance, especially in transformer-based machine learning models (Reddi et al., 2019). In decentralized settings, adaptive methods are also favorable due to their robustness to data heterogeneity, e.g., adaptive methods are guaranteed to converge under heavy-tailed noise while SGD does not (Zhang et al., 2020). These favorable merits, in principle, should be preserved in communication-efficient FL algorithms. The alternative approach of reducing communication costs is to be more thrifty on the communication bits at a single round, i.e., to reduce the O(d) factor, which is dominant in the communication complexity for modern neural networks where d T. Considerable efforts have been devoted to design efficient gradient compression methods.