Fault Tolerant ML: Efficient Meta-Aggregation and Synchronous Training

Jun-5-2024–arXiv.org Artificial Intelligence

In modern machine learning (ML), the paradigm of large-scale distributed training systems has emerged as a cornerstone for advancing complex ML tasks. Distributed ML approaches enable to significantly accelerate the training process; thus facilitating the practical use of larger, more sophisticated models (Zhao et al., 2023). However, as these systems grow in scale and complexity, they become increasingly susceptible to a range of faults and errors. Moreover, distributed ML also propels collaborative learning across decentralized data sources Bonawitz et al. (2019), which often differ in distribution, quality, and volume Bonawitz et al. (2019). For example, data from different geographic locations, devices, or organizations can exhibit considerable variability. This poses a critical challenge: ensuring the training process is resilient to faults and errors in such distributed and heterogeneous environments. Fault-tolerant training becomes imperative to maintain the integrity, accuracy, and reliability of the learned models, especially when the stakes involve critical decision-making based on ML predictions. The Byzantine model Lamport et al. (2019); Guerraoui et al. (2023) provides a robust framework for devising and analyzing fault-tolerant training in distributed ML, due to its capability of capturing both random and adversarial failures.

artificial intelligence, machine learning, momentum, (18 more...)

arXiv.org Artificial Intelligence

Jun-5-2024

arXiv.org PDF

Add feedback

Country:
- Asia > Middle East > Israel (0.14)

Genre:
- Research Report > New Finding (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning (1.00)
  - Representation & Reasoning > Optimization (0.46)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found