Byzantine-Resilient Stochastic Gradient Descent for Distributed Learning: A Lipschitz-Inspired Coordinate-wise Median Approach

Yang, Haibo, Zhang, Xin, Fang, Minghong, Liu, Jia

arXiv.org Machine Learning 

-- In this work, we consider the resilience of distributed algorithms based on stochastic gradient descent (SGD) in distributed learning with potentially Byzantine attackers, who could send arbitrary information to the parameter server to disrupt the training process. T oward this end, we propose a new Lipschitz-inspired coordinate-wise median approach (LICM-SGD) to mitigate Byzantine attacks. We show that our LICM-SGD algorithm can resist up to half of the workers being Byzantine attackers, while still converging almost surely to a stationary region in non-convex settings. Also, our LICM-SGD method does not require any information about the number of attackers and the Lipschitz constant, which makes it attractive for practical implementations. Moreover, our LICM-SGD method enjoys the optimal O ( md) computational time-complexity in the sense that the time-complexity is the same as that of the standard SGD under no attacks. We conduct extensive experiments to show that our LICM-SGD algorithm consistently outperforms existing methods in training multi-class logistic regression and convolutional neural networks with MNIST and CIF AR-10 datasets. In our experiments, LICM-SGD also achieves a much faster running time thanks to its low computational time-complexity. Fueled by the rise of machine learning and big data analytics, recent years have witnessed an ever-increasing interest in solving large-scale empirical risk minimization problems (ERM) - a fundamental optimization problem that underpins a wide range of machine learning applications. In the post-Moore's-Law era, however, to sustain the rapidly growing computational power needs for solving large-scale ERM, the only viable solution is to exploit parallelism at and across different spatial scales. Indeed, the recent success of machine learning applications is due in large part to the use of distributed machine learning frameworks (e.g., TensorFlow [1] and others) which exploit the abundance of distributed CPU/GPU resources in large-scale computing clusters.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found