DropCompute: simple and more robust distributed synchronous training via compute variance reduction

Giladi, Niv, Gottlieb, Shahar, Shkolnik, Moran, Karnieli, Asaf, Banner, Ron, Hoffer, Elad, Levy, Kfir Yehuda, Soudry, Daniel

Sep-24-2023–arXiv.org Artificial Intelligence

Background: Distributed training is essential for large scale training of deep neural networks (DNNs). The dominant methods for large scale DNN training are synchronous (e.g. All-Reduce), but these require waiting for all workers in each step. Thus, these methods are limited by the delays caused by straggling workers. Results: We study a typical scenario in which workers are straggling due to variability in compute time. We find an analytical relation between compute time properties and scalability limitations, caused by such straggling workers. With these findings, we propose a simple yet effective decentralized method to reduce the variation among workers and thus improve the robustness of synchronous training. This method can be integrated with the widely used All-Reduce. Our findings are validated on large-scale training tasks using 200 Gaudi Accelerators.

batch size, compute variance, dropcompute, (15 more...)

arXiv.org Artificial Intelligence

Sep-24-2023

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - Virginia (0.04)
  - Texas > Travis County
    - Austin (0.04)
  - New York > New York County
    - New York City (0.04)
- Europe
  - Italy > Calabria
    - Catanzaro Province > Catanzaro (0.04)
  - Belgium > Brussels-Capital Region
    - Brussels (0.04)
- Asia > Middle East
  - Israel (0.04)

Genre:
- Research Report > New Finding (0.34)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found