Trustworthiness of Stochastic Gradient Descent in Distributed Learning

Li, Hongyang, Wu, Caesar, Chadli, Mohammed, Mammar, Said, Bouvry, Pascal

arXiv.org Artificial Intelligence 

DL is the method used to accelerate the training of deep learning models by distributing training tasks to multiple computing nodes [1]. However, as data scales continue to grow, the complexity of model gradients increases accordingly, for example, consider the training of deep learning on ImageNet [2], which contains over 14 million labeled images and topics with approximately 22,000 categories, leading to constraints on communication efficiency [3]. Gradient compression aimed at reducing communication overhead during gradient transmission between multiple nodes which enhances system computational efficiency [4, 5, 6], thus this has emerged as an effective optimization technique in distributed learning, especially when training complex models to process large-scale data. Among various gradient compression techniques, PowerSGD [6] and Top-K SGD [7] have emerged as prominent solutions for their ability to substantially reduce communication costs while preserving scalability and model accuracy in large-scale distributed learning. These two algorithms are particularly suitable for our study as they represent fundamental approaches to gradient compression: PowerSGD uses low-rank approximation, while TopKSGD leverages sparsification through threshold quantization. Both techniques are widely recognized for their practical effectiveness, especially when combined, to varying extents, with advanced features such as error feedback, warm start, all-reduce, making them ideal candidates of compressed SGD for assessing privacy risks in distributed deep learning systems.