TACOS: Topology-Aware Collective Algorithm Synthesizer for Distributed Training
Won, William, Elavazhagan, Midhilesh, Srinivasan, Sudarshan, Durg, Ajaya, Gupta, Swati, Krishna, Tushar
–arXiv.org Artificial Intelligence
Collective communications are an indispensable part of distributed training. Running a topology-aware collective algorithm is crucial for optimizing communication performance by minimizing congestion. Today such algorithms only exist for a small set of simple topologies, limiting the topologies employed in training clusters and handling irregular topologies due to network failures. In this paper, we propose TACOS, an automated topology-aware collective synthesizer for arbitrary input network topologies. TACOS synthesized 3.73x faster All-Reduce algorithm over baselines, and synthesized collective algorithms for 512-NPU system in just 6.1 minutes.
arXiv.org Artificial Intelligence
Apr-11-2023
- Country:
- Asia > India
- Karnataka (0.14)
- North America > United States
- Minnesota > Hennepin County > Minneapolis (0.14)
- Asia > India
- Genre:
- Research Report (0.50)
- Industry:
- Transportation (0.68)
- Technology: