straggler mitigation
Straggler Mitigation in Distributed Optimization Through Data Encoding
Slow running or straggler tasks can significantly reduce computation speed in distributed computation. Recently, coding-theory-inspired approaches have been applied to mitigate the effect of straggling, through embedding redundancy in certain linear computational steps of the optimization algorithm, thus completing the computation without waiting for the stragglers. In this paper, we propose an alternate approach where we embed the redundancy directly in the data itself, and allow the computation to proceed completely oblivious to encoding. We propose several encoding schemes, and demonstrate that popular batch algorithms, such as gradient descent and L-BFGS, applied in a coding-oblivious manner, deterministically achieve sample path linear convergence to an approximate solution of the original problem, using an arbitrarily varying subset of the nodes at each iteration. Moreover, this approximation can be controlled by the amount of redundancy and the number of nodes used in each iteration. We provide experimental results demonstrating the advantage of the approach over uncoded and data replication strategies.
Reviews: Straggler Mitigation in Distributed Optimization Through Data Encoding
This paper addresses the issue of performing distributed optimization in the presence of straggling/slow computation units. In particular, the paper focuses on the problem of linear regression min_w \ Xw - y\ _2, ---- (1) where X [(X_1) T, (X_2) T,..., (X_m) T] T and y [y_1, y_2,..., y_m] T denote the data points and the corresponding labels. In general, the distributed setup with m worker nodes allocates i -th data point X_i and the associated label y_i to i -th worker node. The linear regression problem is then solved in an iterative manner where messages/information needs to be communicated among (master) server and the worker nodes. However, in practice, some of the workers (aka stragglers) take longer time to completer their end of processing/communication, which slows down the entire distributed optimization problem.
Nested Gradient Codes for Straggler Mitigation in Distributed Machine Learning
Maßny, Luis, Hofmeister, Christoph, Egger, Maximilian, Bitar, Rawad, Wachter-Zeh, Antonia
We consider distributed learning in the presence of slow and unresponsive worker nodes, referred to as stragglers. In order to mitigate the effect of stragglers, gradient coding redundantly assigns partial computations to the worker such that the overall result can be recovered from only the non-straggling workers. Gradient codes are designed to tolerate a fixed number of stragglers. Since the number of stragglers in practice is random and unknown a priori, tolerating a fixed number of stragglers can yield a sub-optimal computation load and can result in higher latency. We propose a gradient coding scheme that can tolerate a flexible number of stragglers by carefully concatenating gradient codes for different straggler tolerance. By proper task scheduling and small additional signaling, our scheme adapts the computation load of the workers to the actual number of stragglers. We analyze the latency of our proposed scheme and show that it has a significantly lower latency than gradient codes.
Straggler Mitigation in Distributed Optimization Through Data Encoding
Karakus, Can, Sun, Yifan, Diggavi, Suhas, Yin, Wotao
Slow running or straggler tasks can significantly reduce computation speed in distributed computation. Recently, coding-theory-inspired approaches have been applied to mitigate the effect of straggling, through embedding redundancy in certain linear computational steps of the optimization algorithm, thus completing the computation without waiting for the stragglers. In this paper, we propose an alternate approach where we embed the redundancy directly in the data itself, and allow the computation to proceed completely oblivious to encoding. We propose several encoding schemes, and demonstrate that popular batch algorithms, such as gradient descent and L-BFGS, applied in a coding-oblivious manner, deterministically achieve sample path linear convergence to an approximate solution of the original problem, using an arbitrarily varying subset of the nodes at each iteration. Moreover, this approximation can be controlled by the amount of redundancy and the number of nodes used in each iteration. We provide experimental results demonstrating the advantage of the approach over uncoded and data replication strategies.
- North America > United States > California > Los Angeles County > Los Angeles (0.15)
- North America > United States > California > Santa Clara County > Los Altos (0.04)
- North America > United States > California > Los Angeles County > Long Beach (0.04)