Communication Optimization for Decentralized Learning atop Bandwidth-limited Edge Networks

Sun, Tingyang, Nguyen, Tuan, He, Ting

arXiv.org Artificial Intelligence 

--Decentralized federated learning (DFL) is a promising machine learning paradigm for bringing artificial intelligence (AI) capabilities to the network edge. Running DFL on top of edge networks, however, faces severe performance challenges due to the extensive parameter exchanges between agents. Most existing solutions for these challenges were based on simplistic communication models, which cannot capture the case of learning over a multi-hop bandwidth-limited network. In this work, we address this problem by jointly designing the communication scheme for the overlay network formed by the agents and the mixing matrix that controls the communication demands between the agents. By carefully analyzing the properties of our problem, we cast each design problem into a tractable optimization and develop an efficient algorithm with guaranteed performance. Our evaluations based on real topology and data show that the proposed algorithm can reduce the total training time by over 80% compared to the baseline without sacrificing accuracy, while significantly improving the computational efficiency over the state of the art. I NTRODUCTION Decentralized federated learning (DFL) [1] is an emerging machine learning paradigm that allows multiple learning agents to collaboratively learn a shared model from their local data without directly sharing the data. In contrast to the centralized federated learning (FL) paradigm [2], DFL gets rid of parameter servers by letting the learning agents directly exchange model updates with their neighbors through peer-to-peer connections, which are then aggregated locally [3]. Since its introduction, DFL has attracted significant attention due to its robustness against a single point of failure and ability to balance the communication complexity across nodes without increasing the computational complexity [1]. Meanwhile, DFL still faces significant performance challenges due to the extensive data transfer between agents.