AITopics | allreduce

Collaborating Authors

allreduce

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Kraken: InherentlyParallelTransformersFor EfficientMulti-DeviceInference

Neural Information Processing SystemsFeb-7-2026, 23:23:39 GMT

Large Transformer networks are increasingly used in settings where low inference latency is necessary to enable new applications and improve the end-user experience.

justification, large language model, machine learning, (20 more...)

Neural Information Processing Systems

Country:

Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
North America > United States > New York > New York County > New York City (0.04)
North America > Canada > Ontario > Toronto (0.04)
Europe > France (0.04)

Genre: Research Report (0.46)

Industry: Government (0.94)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.94)

Add feedback

Pipe-SGD: A Decentralized Pipelined SGD Framework for Distributed Deep Net Training

Youjie Li, Mingchao Yu, Songze Li, Salman Avestimehr, Nam Sung Kim, Alexander Schwing

Neural Information Processing SystemsNov-20-2025, 15:22:39 GMT

Recently, distributed training with AllReduce operations gained popularity as well.

communication, gradient, pipe-sgd, (17 more...)

Neural Information Processing Systems

Country:

North America > United States > California (0.14)
North America > United States > Illinois (0.04)
North America > Canada > Quebec > Montreal (0.04)
Asia > Middle East > Jordan (0.04)

Industry:

Information Technology (0.47)
Government > Regional Government (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

TokenWeave: Efficient Compute-Communication Overlap for Distributed LLM Inference

Gond, Raja, Kwatra, Nipun, Ramjee, Ramachandran

arXiv.org Artificial IntelligenceOct-31-2025

Distributed inference of large language models (LLMs) can introduce overheads of up to 20% even over GPUs connected via high-speed interconnects such as NVLink. Multiple techniques have been proposed to mitigate these overheads by decomposing computations into finer-grained tasks and overlapping communication with sub-tasks as they complete. However, fine-grained decomposition of a large computation into many smaller computations on GPUs results in overheads. Furthermore, the communication itself uses many streaming multiprocessors (SMs), adding to the overhead. We present TokenWeave to address these challenges. TokenWeave proposes a Token-Splitting technique that divides the tokens in the inference batch into two approximately equal subsets in a wave-aware manner. The communication of one subset is then overlapped with the computation of the other. In addition, TokenWeave optimizes the order of the layer normalization computation with respect to communication operations and implements a novel fused AllReduce--RMSNorm kernel that carefully leverages Multimem instruction support available on Hopper and Blackwell NVIDIA GPUs. These optimizations allow TokenWeave to perform communication and RMSNorm using only 2-8 SMs. Moreover, our kernel enables the memory-bound RMSNorm to be overlapped with the other batch's computation, providing additional gains. Our evaluations demonstrate up to 1.29x speedup in latency and 1.26x higher throughput across multiple models and workloads. In several settings, TokenWeave results in better performance compared to an equivalent model with all communication removed.

large language model, machine learning, natural language, (22 more...)

arXiv.org Artificial Intelligence

2505.11329

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.71)

Add feedback

Near-Optimal Sparse Allreduce for Distributed Deep Learning

Li, Shigang, Hoefler, Torsten

arXiv.org Artificial IntelligenceAug-22-2025

Communication overhead is one of the major obstacles to train large deep learning models at scale. Gradient sparsification is a promising technique to reduce the communication volume. However, it is very challenging to obtain real performance improvement because of (1) the difficulty of achieving an scalable and efficient sparse allreduce algorithm and (2) the sparsification overhead. This paper proposes O$k$-Top$k$, a scheme for distributed training with sparse gradients. O$k$-Top$k$ integrates a novel sparse allreduce algorithm (less than 6$k$ communication volume which is asymptotically optimal) with the decentralized parallel Stochastic Gradient Descent (SGD) optimizer, and its convergence is proved. To reduce the sparsification overhead, O$k$-Top$k$ efficiently selects the top-$k$ gradient values according to an estimated threshold. Evaluations are conducted on the Piz Daint supercomputer with neural network models from different deep learning domains. Empirical results show that O$k$-Top$k$ achieves similar model accuracy to dense allreduce. Compared with the optimized dense and the state-of-the-art sparse allreduces, O$k$-Top$k$ is more scalable and significantly improves training throughput (e.g., 3.29x-12.95x improvement for BERT on 256 GPUs).

artificial intelligence, machine learning, top-value, (17 more...)

arXiv.org Artificial Intelligence

doi: 10.1145/3503221.3508399

2201.07598

Country:

Europe > Switzerland (0.28)
North America > United States (0.28)

Genre:

Research Report > New Finding (0.48)
Research Report > Promising Solution (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Breaking (Global) Barriers in Parallel Stochastic Optimization with Wait-Avoiding Group Averaging

Li, Shigang, Ben-Nun, Tal, Nadiradze, Giorgi, Di Girolamo, Salvatore, Dryden, Nikoli, Alistarh, Dan, Hoefler, Torsten

arXiv.org Artificial IntelligenceAug-22-2025

Abstract--Deep learning at scale is dominated by communication time. Distributing samples across nodes usually yields the best performance, but poses scaling challenges due to global information dissemination and load imbalance across uneven sample lengths. State-of-the-art decentralized optimizers mitigate the problem, but require more iterations to achieve the same accuracy as their globally-communicating counterparts. We present Wait-Avoiding Group Model Averaging (WAGMA) SGD, a wait-avoiding stochastic optimizer that reduces global communication via subgroup weight exchange. The key insight is a combination of algorithmic changes to the averaging scheme and the use of a group allreduce operation. We prove the convergence of WAGMA-SGD, and empirically show that it retains convergence rates similar to Allreduce-SGD. For evaluation, we train ResNet-50 on ImageNet; T ransformer for machine translation; and deep reinforcement learning for navigation at scale. Compared with state-of-the-art decentralized SGD variants, WAGMA-SGD significantly improves training throughput (e.g., 2.1 Index Terms --stochastic gradient descent, distributed deep learning, decentralized optimization. The introduction of deep learning is one of the most important advancements in science over the past two decades, powering industries from autonomous driving [1] to drug discovery [2]. With the rise of deep neural networks, their training evolved into a computationally-intensive task that consumes as many resources as modern complex high-performance computing problems [3]. As a result, an abundance of research has been conducted into its scaling and distribution [4]. The leading contenders for largest workloads in deep learning are Neural Language Models [5], [6], Deep Reinforcement Learning (RL) [7], [8] and Neural Architecture Search [9]. In these regimes, computation time is measured in thousands of "GPU days", with some utilizing hundreds of accelerators (GPUs, TPUs) for several weeks [7], [10], [11].

artificial intelligence, deep learning, machine learning, (16 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/TPDS.2020.3040606

2005.00124

Country: Europe > Switzerland > Zürich > Zürich (0.14)

Genre: Research Report (0.64)

Industry:

Leisure & Entertainment (0.46)
Transportation > Ground > Road (0.34)
Information Technology > Robotics & Automation (0.34)
Health & Medicine > Pharmaceuticals & Biotechnology (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Taming Unbalanced Training Workloads in Deep Learning with Partial Collective Operations

Li, Shigang, Ben-Nun, Tal, Di Girolamo, Salvatore, Alistarh, Dan, Hoefler, Torsten

arXiv.org Artificial IntelligenceAug-22-2025

Load imbalance pervasively exists in distributed deep learning training systems, either caused by the inherent imbalance in learned tasks or by the system itself. Traditional synchronous Stochastic Gradient Descent (SGD) achieves good accuracy for a wide variety of tasks, but relies on global synchronization to accumulate the gradients at every training step. In this paper, we propose eager-SGD, which relaxes the global synchronization for decentralized accumulation. To implement eager-SGD, we propose to use two partial collectives: solo and majority. With solo allreduce, the faster processes contribute their gradients eagerly without waiting for the slower processes, whereas with majority allreduce, at least half of the participants must contribute gradients before continuing, all without using a central parameter server. We theoretically prove the convergence of the algorithms and describe the partial collectives in detail. Experimental results on load-imbalanced environments (CIFAR-10, ImageNet, and UCF101 datasets) show that eager-SGD achieves 1.27x speedup over the state-of-the-art synchronous SGD, without losing accuracy.

artificial intelligence, machine learning, opération, (17 more...)

arXiv.org Artificial Intelligence

doi: 10.1145/3332466.3374528

1908.04207

Country:

Europe (0.67)
North America > United States > California (0.15)

Genre: Research Report (1.00)

Industry: Information Technology > Services (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Collective Communication Profiling of Modern-day Machine Learning Workloads

Gupta, Jit, Li, Andrew, Banka, Tarun, Cohen, Ariel, Sridhar, T., Yavatkar, Raj

arXiv.org Artificial IntelligenceJul-11-2025

Machine Learning jobs, carried out on large number of distributed high performance systems, involve periodic communication using operations like AllReduce, AllGather, and Broadcast. These operations may create high bandwidth and bursty traffic patterns, leading to network congestion and packet loss, thus impacting the performance of these jobs. Hence it is imperative to analyze these patterns, which can be helpful in provisioning network resources depending on the type of machine learning workloads. In this poster we carry out extensive analysis of the collective communication behavior seen in a wide variety of models (ex. DeepSeek, GPT, Llama, etc.) To achieve this we instrument Nvidia Collective Communication Library logging functionality for richer context about the collectives and workloads. We adjust configuration parameters that influence collective communication behavior, such as parallelism, number of nodes, and model type. This overview presents and discusses some of the results on the collective communication behavior for the open source DeepSeek V3 inferencing model, which includes operation type and count, transfer sizes per operation, and request size distribution. Our analysis shows that it makes sense to rethink current collective communication frameworks and network topologies so as to accommodate the effect of network anomalies on the mentioned workloads.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2507.07117

Genre: Research Report (0.65)

Industry:

Telecommunications > Networks (0.71)
Information Technology > Networks (0.54)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.94)

Add feedback

DiLoCoX: A Low-Communication Large-Scale Training Framework for Decentralized Cluster

Qi, Ji, Zhu, WenPeng, Li, Li, Wu, Ming, Wu, YingJun, He, Wu, Gao, Xun, Zeng, Jason, Heinrich, Michael

arXiv.org Artificial IntelligenceJun-27-2025

The distributed training of foundation models, particularly large language models (LLMs), demands a high level of communication. Consequently, it is highly dependent on a centralized cluster with fast and reliable interconnects. Can we conduct training on slow networks and thereby unleash the power of decentralized clusters when dealing with models exceeding 100 billion parameters? In this paper, we propose DiLoCoX, a low-communication large-scale decentralized cluster training framework. It combines Pipeline Parallelism with Dual Optimizer Policy, One-Step-Delay Overlap of Communication and Local Training, and an Adaptive Gradient Compression Scheme. This combination significantly improves the scale of parameters and the speed of model pre-training. We justify the benefits of one-step-delay overlap of communication and local training, as well as the adaptive gradient compression scheme, through a theoretical analysis of convergence. Empirically, we demonstrate that DiLoCoX is capable of pre-training a 107B foundation model over a 1Gbps network. Compared to vanilla AllReduce, DiLoCoX can achieve a 357x speedup in distributed training while maintaining negligible degradation in model convergence. To the best of our knowledge, this is the first decentralized training framework successfully applied to models with over 100 billion parameters.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2506.21263

Country:

Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
North America > United States > California > Santa Clara County > Stanford (0.04)
Asia > Middle East > Jordan (0.04)
Asia > China > Jiangsu Province (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

EQuARX: Efficient Quantized AllReduce in XLA for Distributed Machine Learning Acceleration

Ahmed, Ibrahim, Schaefer, Clemens, Tabak, Gil, Vnukov, Denis, Zhang, Zenong, chern, Felix, Yevtushenko, Anatoliy, Davis, Andy

arXiv.org Artificial IntelligenceJun-24-2025

--While Large Language Models (LLMs) have become highly influential, their enormous scale presents significant deployment challenges. Efficiently serving these models typically requires distributing them across numerous accelerator devices, which introduces substantial performance overhead from inter-device communication (collectives). While model quantization has been widely adopted to reduce the memory and compute requirements of LLM weights and activations with minimal quality impact, applying quantization directly to collectives like AllReduce is inherently difficult due to the inter-device summation involved, which can lead to numerical instability or significant error accumulation. In this work, we present a native dynamic block-wise efficient quantized AllReduce within the XLA compiler for TPUs (EQuARX). By using TPU-friendly quantization and deep pipelining of communication and compute, EQuARX with int8 precision achieves a 1. 8 speedup over baseline BF16 AllReduce across various network topologies. Furthermore, EQuARX accelerates the prefill stage of Gemma 3 27B by 1.25 and Gemma 3 12B by 1.1, respectively, with small to negligible impact on quality.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2506.17615

Country: North America > United States (0.15)

Genre: Research Report (0.65)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.75)

Add feedback

Lion Cub: Minimizing Communication Overhead in Distributed Lion

Ishikawa, Satoki, Ben-Nun, Tal, Van Essen, Brian, Yokota, Rio, Dryden, Nikoli

arXiv.org Artificial IntelligenceNov-25-2024

Communication overhead is a key challenge in distributed deep learning, especially on slower Ethernet interconnects, and given current hardware trends, communication is likely to become a major bottleneck. While gradient compression techniques have been explored for SGD and Adam, the Lion optimizer has the distinct advantage that its update vectors are the output of a sign operation, enabling straightforward quantization. However, simply compressing updates for communication and using techniques like majority voting fails to lead to end-to-end speedups due to inefficient communication algorithms and reduced convergence. We analyze three factors critical to distributed learning with Lion: optimizing communication methods, identifying effective quantization methods, and assessing the necessity of momentum synchronization. Our findings show that quantization techniques adapted to Lion and selective momentum synchronization can significantly reduce communication costs while maintaining convergence. We combine these into Lion Cub, which enables up to 5x speedups in end-to-end training compared to Lion. This highlights Lion's potential as a communication-efficient solution for distributed training.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2411.16462

Country:

Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.04)
North America > United States (0.04)

Genre: Research Report > New Finding (0.86)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.93)

Add feedback