FedCompass: Efficient Cross-Silo Federated Learning on Heterogeneous Client Devices using a Computing Power Aware Scheduler
Li, Zilinghan, Chaturvedi, Pranshu, He, Shilan, Chen, Han, Singh, Gagandeep, Kindratenko, Volodymyr, Huerta, E. A., Kim, Kibaek, Madduri, Ravi
–arXiv.org Artificial Intelligence
Cross-silo federated learning offers a promising solution to collaboratively train robust and generalized AI models without compromising the privacy of local datasets, e.g., healthcare, financial, as well as scientific projects that lack a centralized data facility. Nonetheless, because of the disparity of computing resources among different clients (i.e., device heterogeneity), synchronous federated learning algorithms suffer from degraded efficiency when waiting for straggler clients. Similarly, asynchronous federated learning algorithms experience degradation in the convergence rate and final model accuracy on non-identically and independently distributed (non-IID) heterogeneous datasets due to stale local models and client drift. To address these limitations in cross-silo federated learning with heterogeneous clients and data, we propose FedCompass, an innovative semiasynchronous federated learning algorithm with a computing power aware scheduler on the server side, which adaptively assigns varying amounts of training tasks to different clients using the knowledge of the computing power of individual clients. FedCompass ensures that multiple locally trained models from clients are received almost simultaneously as a group for aggregation, effectively reducing the staleness of local models. At the same time, the overall training process remains asynchronous, eliminating prolonged waiting periods from straggler clients. Using diverse non-IID heterogeneous distributed datasets, we demonstrate that FedCompass achieves faster convergence and higher accuracy than other asynchronous algorithms while remaining more efficient than synchronous algorithms when performing federated learning on heterogeneous clients. Federated learning (FL) is a collaborative model training approach where multiple clients train a global model under the orchestration of a central server (Konečnỳ et al., 2016; McMahan et al., 2017; Yang et al., 2019; Kairouz et al., 2021). FL typically runs two steps iteratively: (i) the server distributes the global model to clients to train it using their local data; (ii) the server collects the locally trained models and updates the global model by aggregating them. Federated Averaging (FedAvg) (McMahan et al., 2017) is the most popular FL algorithm where each client trains a model using local data for Q local steps in each training round, after which the orchestration server aggregates all local models by performing a weighted averaging and sends the updated global model back to all clients for the next round of training. By leveraging the training data from multiple clients without explicitly sharing, FL empowers the training of more robust and generalized models while preserving the privacy of client data.
arXiv.org Artificial Intelligence
Sep-26-2023
- Country:
- North America > United States > Illinois (0.14)
- Genre:
- Research Report > Promising Solution (0.34)
- Industry:
- Technology: