TACOS: Topology-Aware Collective Algorithm Synthesizer for Distributed Training

Won, William, Elavazhagan, Midhilesh, Srinivasan, Sudarshan, Durg, Ajaya, Gupta, Swati, Krishna, Tushar

arXiv.org Artificial Intelligence 

Collective communications are an indispensable part of distributed training. Running a topology-aware collective algorithm is crucial for optimizing communication performance by minimizing congestion. Today such algorithms only exist for a small set of simple topologies, limiting the topologies employed in training clusters and handling irregular topologies due to network failures. In this paper, we propose TACOS, an automated topology-aware collective synthesizer for arbitrary input network topologies. TACOS synthesized 3.73x faster All-Reduce algorithm over baselines, and synthesized collective algorithms for 512-NPU system in just 6.1 minutes.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found