FedImpro: Measuring and Improving Client Update in Federated Learning

Tang, Zhenheng, Zhang, Yonggang, Shi, Shaohuai, Tian, Xinmei, Liu, Tongliang, Han, Bo, Chu, Xiaowen

arXiv.org Artificial Intelligence 

Federated Learning (FL) models often experience client drift caused by heterogeneous data, where the distribution of data differs across clients. To address this issue, advanced research primarily focuses on manipulating the existing gradients to achieve more consistent client models. In this paper, we present an alternative perspective on client drift and aim to mitigate it by generating improved local models. First, we analyze the generalization contribution of local training and conclude that this generalization contribution is bounded by the conditional Wasserstein distance between the data distribution of different clients. Then, we propose FedImpro, to construct similar conditional distributions for local training. Specifically, FedImpro decouples the model into high-level and low-level components, and trains the high-level portion on reconstructed feature distributions. This approach enhances the generalization contribution and reduces the dissimilarity of gradients in FL. Experimental results show that FedImpro can help FL defend against data heterogeneity and enhance the generalization performance of the model. The convergence rate and the generalization performance of FL suffers from heterogeneous data distributions across clients (Non-IID data) (Kairouz et al., 2019). The FL community theoretically and empirically found that the "client drift" caused by the heterogeneous data is the main reason of such a performance drop (Guo et al.; Wang et al., 2020a). The client drift means the far distance between local models on clients after being trained on private datasets. Recent convergence analysis (Reddi et al., 2021; Woodworth et al., 2020) of FedAvg shows that the degree of client drift is linearly upper bounded by gradient dissimilarity. Therefore, most existing works (Karimireddy et al., 2020; Wang et al., 2020a) focus on gradient correction techniques to accelerate the convergence rate of local training. However, these techniques rely on manipulating gradients and updates to obtain more similar gradients (Woodworth et al., 2020; Wang et al., 2020a; Sun et al., 2023a). However, the empirical results of these methods show that there still exists a performance gap between FL and centralized training. In this paper, we provide a novel view to correct gradients and updates. Specifically, we formulate the objective of local training in FL systems as a generalization contribution problem. The generalization contribution means how much local training on one client can improve the generalization performance on other clients' distributions for server models. We evaluate the generalization performance of a local model on other clients' data distributions.