Goto

Collaborating Authors

 fedsubavg


Federated Submodel Optimization for Hot and Cold Data Features

Neural Information Processing Systems

We focus on federated learning in practical recommender systems and natural language processing scenarios. The global model for federated optimization typically contains a large and sparse embedding layer, while each client's local data tend to interact with part of features, updating only a small submodel with the feature-related embedding vectors. We identify a new and important issue that distinct data features normally involve different numbers of clients, generating the differentiation of hot and cold features. We further reveal that the classical federated averaging algorithm (FedAvg) or its variants, which randomly selects clients to participate and uniformly averages their submodel updates, will be severely slowed down, because different parameters of the global model are optimized at different speeds. More specifically, the model parameters related to hot (resp., cold) features will be updated quickly (resp., slowly).


A Proof of Theorem

Neural Information Processing Systems

A.1 Proof Sketch We first introduce the following lemma: Lemma 1. In general, it is hard to develop a convergence rate for objective values. By Theorem 5, we can also show the superiority of FedSubAvg over FedAvg. We then assume that FedSubAvg always activates all the clients at the beginning of each communication round and then uses the parameters maintained by a few selected clients to generate the next-round parameter. It is clear that this update scheme is equivalent to the original.


Federated Submodel Optimization for Hot and Cold Data Features Y ucheng Ding

Neural Information Processing Systems

The global model for federated optimization typically contains a large and sparse embedding layer, while each client's local data tend to interact with part of features, updating only a small submodel with the feature-related embedding vectors.


Federated Submodel Optimization for Hot and Cold Data Features

Neural Information Processing Systems

We focus on federated learning in practical recommender systems and natural language processing scenarios. The global model for federated optimization typically contains a large and sparse embedding layer, while each client's local data tend to interact with part of features, updating only a small submodel with the feature-related embedding vectors. We identify a new and important issue that distinct data features normally involve different numbers of clients, generating the differentiation of hot and cold features. We further reveal that the classical federated averaging algorithm (FedAvg) or its variants, which randomly selects clients to participate and uniformly averages their submodel updates, will be severely slowed down, because different parameters of the global model are optimized at different speeds. More specifically, the model parameters related to hot (resp., cold) features will be updated quickly (resp., slowly).


A Proof of Theorem 1, A2, B1, B

Neural Information Processing Systems

A.1 Proof Sketch We first introduce the following lemma: Lemma 1. We first consider the condition number of Ĥ when X is in a locally convex area. In general, it is hard to develop a convergence rate for objective values. However, when the global model is in a locally convex area of f, we can obtain the relationship between the gradient and the local optimum. Theorem 4. When there is no parameter heat dispersion, and X is in a µ-strongly convex area of f We note that there is a difference between equation 18 and 21: for each client i, equation 18 involves all the parameters of the full model while equation 21 involves only partial parameters of the submodel, which causes a change in the lower bound of T (Y) and further leads to a change of conclusion.


Federated Submodel Optimization for Hot and Cold Data Features Yucheng Ding 1 Fan Wu1 Shaojie Tang

Neural Information Processing Systems

We focus on federated learning in practical recommender systems and natural language processing scenarios. The global model for federated optimization typically contains a large and sparse embedding layer, while each client's local data tend to interact with part of features, updating only a small submodel with the feature-related embedding vectors. We identify a new and important issue that distinct data features normally involve different numbers of clients, generating the differentiation of hot and cold features. We further reveal that the classical federated averaging algorithm (FedAvg) or its variants, which randomly selects clients to participate and uniformly averages their submodel updates, will be severely slowed down, because different parameters of the global model are optimized at different speeds. More specifically, the model parameters related to hot (resp., cold) features will be updated quickly (resp., slowly).


Federated Submodel Averaging

Ding, Yucheng, Wu, Chaoyue Niu. Fan, Tang, Shaojie, Lv, Chengfei, Feng, Yanghe, Chen, Guihai

arXiv.org Artificial Intelligence

We study practical data characteristics underlying federated learning, where non-i.i.d. data from clients have sparse features, and a certain client's local data normally involves only a small part of the full model, called a submodel. Due to data sparsity, the classical federated averaging (FedAvg) algorithm or its variants will be severely slowed down, because when updating the global model, each client's zero update of the full model excluding its submodel is inaccurately aggregated. Therefore, we propose federated submodel averaging (FedSubAvg), ensuring that the expectation of the global update of each model parameter is equal to the average of the local updates of the clients who involve it. We theoretically proved the convergence rate of FedSubAvg by deriving an upper bound under a new metric called the element-wise gradient norm. In particular, this new metric can characterize the convergence of federated optimization over sparse data, while the conventional metric of squared gradient norm used in FedAvg and its variants cannot. We extensively evaluated FedSubAvg over both public and industrial datasets. The evaluation results demonstrate that FedSubAvg significantly outperforms FedAvg and its variants.