Fault Tolerant ML: Efficient Meta-Aggregation and Synchronous Training

Dahan, Tehila, Levy, Kfir Y.

arXiv.org Artificial Intelligence 

In modern machine learning (ML), the paradigm of large-scale distributed training systems has emerged as a cornerstone for advancing complex ML tasks. Distributed ML approaches enable to significantly accelerate the training process; thus facilitating the practical use of larger, more sophisticated models (Zhao et al., 2023). However, as these systems grow in scale and complexity, they become increasingly susceptible to a range of faults and errors. Moreover, distributed ML also propels collaborative learning across decentralized data sources Bonawitz et al. (2019), which often differ in distribution, quality, and volume Bonawitz et al. (2019). For example, data from different geographic locations, devices, or organizations can exhibit considerable variability. This poses a critical challenge: ensuring the training process is resilient to faults and errors in such distributed and heterogeneous environments. Fault-tolerant training becomes imperative to maintain the integrity, accuracy, and reliability of the learned models, especially when the stakes involve critical decision-making based on ML predictions. The Byzantine model Lamport et al. (2019); Guerraoui et al. (2023) provides a robust framework for devising and analyzing fault-tolerant training in distributed ML, due to its capability of capturing both random and adversarial failures.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found