Efficient AllReduce with Stragglers
Devraj, Arjun, Ding, Eric, Kumar, Abhishek Vijaya, Kleinberg, Robert, Singh, Rachee
–arXiv.org Artificial Intelligence
Distributed machine learning workloads use data and tensor parallelism for training and inference, both of which rely on the AllReduce collective to synchronize gradients or activations. However, AllReduce algorithms are delayed by the slowest GPU to reach the synchronization barrier before the collective (i.e., the straggler). To address this challenge, we propose StragglAR: a parallel algorithm for AllReduce that accelerates distributed training and inference by exploiting natural variation in GPU execution times. StragglAR implements a ReduceScatter among the remaining GPUs during the straggler-induced delay, and then executes a novel collective algorithm to complete the AllReduce once the final GPU reaches the synchronization barrier. StragglAR achieves a 2x theoretical speedup over popular bandwidth-efficient algorithms for large GPU clusters, surpassing the lower bound for bandwidth-optimal synchronous AllReduce by leveraging the asymmetry in when GPUs reach the synchronization barrier. On an 8-GPU server, StragglAR provides a 25% speedup over state-of-the-art AllReduce algorithms.
arXiv.org Artificial Intelligence
Sep-30-2025
- Country:
- Europe
- Germany > Baden-Württemberg
- Stuttgart Region > Stuttgart (0.04)
- Hungary > Budapest
- Budapest (0.04)
- Italy > Calabria
- Catanzaro Province > Catanzaro (0.04)
- Poland > Lesser Poland Province
- Kraków (0.04)
- Germany > Baden-Württemberg
- North America > United States
- California > Alameda County
- Livermore (0.04)
- Minnesota > Hennepin County
- Minneapolis (0.14)
- California > Alameda County
- Europe
- Genre:
- Research Report (0.64)
- Industry:
- Information Technology (0.69)
- Technology:
- Information Technology
- Architecture > Distributed Systems (0.88)
- Artificial Intelligence
- Machine Learning > Neural Networks
- Deep Learning (0.95)
- Natural Language (0.95)
- Representation & Reasoning (1.00)
- Machine Learning > Neural Networks
- Communications > Networks (1.00)
- Graphics (1.00)
- Information Technology