Efficient AllReduce with Stragglers

Devraj, Arjun, Ding, Eric, Kumar, Abhishek Vijaya, Kleinberg, Robert, Singh, Rachee

Sep-30-2025–arXiv.org Artificial Intelligence

Distributed machine learning workloads use data and tensor parallelism for training and inference, both of which rely on the AllReduce collective to synchronize gradients or activations. However, AllReduce algorithms are delayed by the slowest GPU to reach the synchronization barrier before the collective (i.e., the straggler). To address this challenge, we propose StragglAR: a parallel algorithm for AllReduce that accelerates distributed training and inference by exploiting natural variation in GPU execution times. StragglAR implements a ReduceScatter among the remaining GPUs during the straggler-induced delay, and then executes a novel collective algorithm to complete the AllReduce once the final GPU reaches the synchronization barrier. StragglAR achieves a 2x theoretical speedup over popular bandwidth-efficient algorithms for large GPU clusters, surpassing the lower bound for bandwidth-optimal synchronous AllReduce by leveraging the asymmetry in when GPUs reach the synchronization barrier. On an 8-GPU server, StragglAR provides a 25% speedup over state-of-the-art AllReduce algorithms.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

Sep-30-2025

arXiv.org PDF

Add feedback

Country:
- Europe (0.92)
- North America > United States
  - Minnesota (0.28)

Genre:
- Research Report (0.64)

Industry:
- Information Technology (0.69)

Technology:
- Information Technology
  - Graphics (1.00)
  - Communications > Networks (1.00)
  - Architecture > Distributed Systems (0.88)
  - Artificial Intelligence
    - Representation & Reasoning (1.00)
    - Natural Language (0.95)
    - Machine Learning > Neural Networks
      - Deep Learning (0.95)