Goto

Collaborating Authors

 straggler




Sageflow: Robust Federated Learning against Both Stragglers and Adversaries (Supplementary Material)

Neural Information Processing Systems

The hyperparameter settings for Sageflow are shown in Table 1. Table 2. Backdoor attack: The hyperparameter details are shown in Table 4. Table 4: Hyperparameters for Sageflow with both stragglers and adversaries, under backdoor attackDataset γ λ δ E We specify these values in Table 5. The local batch size is set to 64. Figure 1 shows the performance under the no-scaled backdoor attack with only adversaries (no stragglers). Figure 1 shows the case with both stragglers and adversaries. Some additional experiments were conducted under model poisoning with the scale factor 10. Figure 1 The loss associated with a poisoned device increases if we increase the scale factor from 0.1 to 10. Sageflow but also Zeno+ can effectively defend against the attacks with only adversaries.


Sageflow: Robust Federated Learning against Both Stragglers and Adversaries

Neural Information Processing Systems

While federated learning (FL) allows efficient model training with local data at edge devices, among major issues still to be resolved are: slow devices known as stragglers and malicious attacks launched by adversaries.


FLuID: Mitigating Stragglers in Federated Learning using Invariant Dropout

Neural Information Processing Systems

Federated Learning (FL) allows machine learning models to train locally on individual mobile devices, synchronizing model updates via a shared server. This approach safeguards user privacy; however, it also generates a heterogeneous training environment due to the varying performance capabilities across devices. As a result, "straggler" devices with lower performance often dictate the overalltraining time in FL. In this work, we aim to alleviate this performance bottleneck due to stragglers by dynamically balancing the training load across the system. We introduce Invariant Dropout, a method that extracts a sub-model based on the weight update threshold, thereby minimizing potential impacts on accuracy.


Coded Distributed Computing for Inverse Problems

Neural Information Processing Systems

Computationally intensive distributed and parallel computing is often bottlenecked by a small set of slow workers known as stragglers. In this paper, we utilize the emerging idea of ``coded computation'' to design a novel error-correcting-code inspired technique for solving linear inverse problems under specific iterative methods in a parallelized implementation affected by stragglers. Example machine-learning applications include inverse problems such as personalized PageRank and sampling on graphs. We provably show that our coded-computation technique can reduce the mean-squared error under a computational deadline constraint. In fact, the ratio of mean-squared error of replication-based and coded techniques diverges to infinity as the deadline increases. Our experiments for personalized PageRank performed on real systems and real social networks show that this ratio can be as large as $10^4$. Further, unlike coded-computation techniques proposed thus far, our strategy combines outputs of all workers, including the stragglers, to produce more accurate estimates at the computational deadline. This also ensures that the accuracy degrades ``gracefully'' in the event that the number of stragglers is large.


Short-Dot: Computing Large Linear Transforms Distributedly Using Coded Short Dot Products

Neural Information Processing Systems

Faced with saturation of Moore's law and increasing size and dimension of data, system designers have increasingly resorted to parallel and distributed computing to reduce computation time of machine-learning algorithms. However, distributed computing is often bottle necked by a small fraction of slow processors called stragglers that reduce the speed of computation because the fusion node has to wait for all processors to complete their processing. To combat the effect of stragglers, recent literature proposes introducing redundancy in computations across processors, e.g., using repetition-based strategies or erasure codes. The fusion node can exploit this redundancy by completing the computation using outputs from only a subset of the processors, ignoring the stragglers. In this paper, we propose a novel technique - that we call Short-Dot - to introduce redundant computations in a coding theory inspired fashion, for computing linear transforms of long vectors. Instead of computing long dot products as required in the original linear transform, we construct a larger number of redundant and short dot products that can be computed more efficiently at individual processors. Further, only a subset of these short dot products are required at the fusion node to finish the computation successfully. We demonstrate through probabilistic analysis as well as experiments on computing clusters that Short-Dot offers significant speed-up compared to existing techniques. We also derive trade-offs between the length of the dot-products and the resilience to stragglers (number of processors required to finish), for any such strategy and compare it to that achieved by our strategy.