allreduce
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- Europe > France (0.04)
- North America > United States > California (0.14)
- North America > United States > Illinois (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- Asia > Middle East > Jordan (0.04)
- Information Technology (0.47)
- Government > Regional Government (0.46)
- North America > United States > California (0.14)
- North America > United States > Illinois (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- Asia > Middle East > Jordan (0.04)
- Information Technology (0.47)
- Government > Regional Government (0.46)
TokenWeave: Efficient Compute-Communication Overlap for Distributed LLM Inference
Gond, Raja, Kwatra, Nipun, Ramjee, Ramachandran
Distributed inference of large language models (LLMs) can introduce overheads of up to 20% even over GPUs connected via high-speed interconnects such as NVLink. Multiple techniques have been proposed to mitigate these overheads by decomposing computations into finer-grained tasks and overlapping communication with sub-tasks as they complete. However, fine-grained decomposition of a large computation into many smaller computations on GPUs results in overheads. Furthermore, the communication itself uses many streaming multiprocessors (SMs), adding to the overhead. We present TokenWeave to address these challenges. TokenWeave proposes a Token-Splitting technique that divides the tokens in the inference batch into two approximately equal subsets in a wave-aware manner. The communication of one subset is then overlapped with the computation of the other. In addition, TokenWeave optimizes the order of the layer normalization computation with respect to communication operations and implements a novel fused AllReduce--RMSNorm kernel that carefully leverages Multimem instruction support available on Hopper and Blackwell NVIDIA GPUs. These optimizations allow TokenWeave to perform communication and RMSNorm using only 2-8 SMs. Moreover, our kernel enables the memory-bound RMSNorm to be overlapped with the other batch's computation, providing additional gains. Our evaluations demonstrate up to 1.29x speedup in latency and 1.26x higher throughput across multiple models and workloads. In several settings, TokenWeave results in better performance compared to an equivalent model with all communication removed.
Breaking (Global) Barriers in Parallel Stochastic Optimization with Wait-Avoiding Group Averaging
Li, Shigang, Ben-Nun, Tal, Nadiradze, Giorgi, Di Girolamo, Salvatore, Dryden, Nikoli, Alistarh, Dan, Hoefler, Torsten
Abstract--Deep learning at scale is dominated by communication time. Distributing samples across nodes usually yields the best performance, but poses scaling challenges due to global information dissemination and load imbalance across uneven sample lengths. State-of-the-art decentralized optimizers mitigate the problem, but require more iterations to achieve the same accuracy as their globally-communicating counterparts. We present Wait-Avoiding Group Model Averaging (WAGMA) SGD, a wait-avoiding stochastic optimizer that reduces global communication via subgroup weight exchange. The key insight is a combination of algorithmic changes to the averaging scheme and the use of a group allreduce operation. We prove the convergence of WAGMA-SGD, and empirically show that it retains convergence rates similar to Allreduce-SGD. For evaluation, we train ResNet-50 on ImageNet; T ransformer for machine translation; and deep reinforcement learning for navigation at scale. Compared with state-of-the-art decentralized SGD variants, WAGMA-SGD significantly improves training throughput (e.g., 2.1 Index Terms --stochastic gradient descent, distributed deep learning, decentralized optimization. The introduction of deep learning is one of the most important advancements in science over the past two decades, powering industries from autonomous driving [1] to drug discovery [2]. With the rise of deep neural networks, their training evolved into a computationally-intensive task that consumes as many resources as modern complex high-performance computing problems [3]. As a result, an abundance of research has been conducted into its scaling and distribution [4]. The leading contenders for largest workloads in deep learning are Neural Language Models [5], [6], Deep Reinforcement Learning (RL) [7], [8] and Neural Architecture Search [9]. In these regimes, computation time is measured in thousands of "GPU days", with some utilizing hundreds of accelerators (GPUs, TPUs) for several weeks [7], [10], [11].
- Europe > Switzerland > Zürich > Zürich (0.14)
- Europe > Austria (0.04)
- North America > United States > Illinois (0.04)
- (3 more...)
- Leisure & Entertainment (0.46)
- Transportation > Ground > Road (0.34)
- Information Technology > Robotics & Automation (0.34)
- Health & Medicine > Pharmaceuticals & Biotechnology (0.34)
Taming Unbalanced Training Workloads in Deep Learning with Partial Collective Operations
Li, Shigang, Ben-Nun, Tal, Di Girolamo, Salvatore, Alistarh, Dan, Hoefler, Torsten
Load imbalance pervasively exists in distributed deep learning training systems, either caused by the inherent imbalance in learned tasks or by the system itself. Traditional synchronous Stochastic Gradient Descent (SGD) achieves good accuracy for a wide variety of tasks, but relies on global synchronization to accumulate the gradients at every training step. In this paper, we propose eager-SGD, which relaxes the global synchronization for decentralized accumulation. To implement eager-SGD, we propose to use two partial collectives: solo and majority. With solo allreduce, the faster processes contribute their gradients eagerly without waiting for the slower processes, whereas with majority allreduce, at least half of the participants must contribute gradients before continuing, all without using a central parameter server. We theoretically prove the convergence of the algorithms and describe the partial collectives in detail. Experimental results on load-imbalanced environments (CIFAR-10, ImageNet, and UCF101 datasets) show that eager-SGD achieves 1.27x speedup over the state-of-the-art synchronous SGD, without losing accuracy.
- North America > United States > California > San Diego County > San Diego (0.05)
- Europe > Switzerland > Zürich > Zürich (0.04)
- North America > United States > New Jersey > Middlesex County > Piscataway (0.04)
- (6 more...)
Near-Optimal Sparse Allreduce for Distributed Deep Learning
Communication overhead is one of the major obstacles to train large deep learning models at scale. Gradient sparsification is a promising technique to reduce the communication volume. However, it is very challenging to obtain real performance improvement because of (1) the difficulty of achieving an scalable and efficient sparse allreduce algorithm and (2) the sparsification overhead. This paper proposes O$k$-Top$k$, a scheme for distributed training with sparse gradients. O$k$-Top$k$ integrates a novel sparse allreduce algorithm (less than 6$k$ communication volume which is asymptotically optimal) with the decentralized parallel Stochastic Gradient Descent (SGD) optimizer, and its convergence is proved. To reduce the sparsification overhead, O$k$-Top$k$ efficiently selects the top-$k$ gradient values according to an estimated threshold. Evaluations are conducted on the Piz Daint supercomputer with neural network models from different deep learning domains. Empirical results show that O$k$-Top$k$ achieves similar model accuracy to dense allreduce. Compared with the optimized dense and the state-of-the-art sparse allreduces, O$k$-Top$k$ is more scalable and significantly improves training throughput (e.g., 3.29x-12.95x improvement for BERT on 256 GPUs).
- Europe > Switzerland > Zürich > Zürich (0.14)
- Asia > South Korea > Seoul > Seoul (0.05)
- North America > United States > California > Los Angeles County > Long Beach (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- Research Report > New Finding (0.48)
- Research Report > Promising Solution (0.34)
Collective Communication Profiling of Modern-day Machine Learning Workloads
Gupta, Jit, Li, Andrew, Banka, Tarun, Cohen, Ariel, Sridhar, T., Yavatkar, Raj
Machine Learning jobs, carried out on large number of distributed high performance systems, involve periodic communication using operations like AllReduce, AllGather, and Broadcast. These operations may create high bandwidth and bursty traffic patterns, leading to network congestion and packet loss, thus impacting the performance of these jobs. Hence it is imperative to analyze these patterns, which can be helpful in provisioning network resources depending on the type of machine learning workloads. In this poster we carry out extensive analysis of the collective communication behavior seen in a wide variety of models (ex. DeepSeek, GPT, Llama, etc.) To achieve this we instrument Nvidia Collective Communication Library logging functionality for richer context about the collectives and workloads. We adjust configuration parameters that influence collective communication behavior, such as parallelism, number of nodes, and model type. This overview presents and discusses some of the results on the collective communication behavior for the open source DeepSeek V3 inferencing model, which includes operation type and count, transfer sizes per operation, and request size distribution. Our analysis shows that it makes sense to rethink current collective communication frameworks and network topologies so as to accommodate the effect of network anomalies on the mentioned workloads.
- Telecommunications > Networks (0.71)
- Information Technology > Networks (0.54)
DiLoCoX: A Low-Communication Large-Scale Training Framework for Decentralized Cluster
Qi, Ji, Zhu, WenPeng, Li, Li, Wu, Ming, Wu, YingJun, He, Wu, Gao, Xun, Zeng, Jason, Heinrich, Michael
The distributed training of foundation models, particularly large language models (LLMs), demands a high level of communication. Consequently, it is highly dependent on a centralized cluster with fast and reliable interconnects. Can we conduct training on slow networks and thereby unleash the power of decentralized clusters when dealing with models exceeding 100 billion parameters? In this paper, we propose DiLoCoX, a low-communication large-scale decentralized cluster training framework. It combines Pipeline Parallelism with Dual Optimizer Policy, One-Step-Delay Overlap of Communication and Local Training, and an Adaptive Gradient Compression Scheme. This combination significantly improves the scale of parameters and the speed of model pre-training. We justify the benefits of one-step-delay overlap of communication and local training, as well as the adaptive gradient compression scheme, through a theoretical analysis of convergence. Empirically, we demonstrate that DiLoCoX is capable of pre-training a 107B foundation model over a 1Gbps network. Compared to vanilla AllReduce, DiLoCoX can achieve a 357x speedup in distributed training while maintaining negligible degradation in model convergence. To the best of our knowledge, this is the first decentralized training framework successfully applied to models with over 100 billion parameters.
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- North America > United States > California > Santa Clara County > Stanford (0.04)
- Asia > Middle East > Jordan (0.04)
- Asia > China > Jiangsu Province (0.04)
EQuARX: Efficient Quantized AllReduce in XLA for Distributed Machine Learning Acceleration
Ahmed, Ibrahim, Schaefer, Clemens, Tabak, Gil, Vnukov, Denis, Zhang, Zenong, chern, Felix, Yevtushenko, Anatoliy, Davis, Andy
--While Large Language Models (LLMs) have become highly influential, their enormous scale presents significant deployment challenges. Efficiently serving these models typically requires distributing them across numerous accelerator devices, which introduces substantial performance overhead from inter-device communication (collectives). While model quantization has been widely adopted to reduce the memory and compute requirements of LLM weights and activations with minimal quality impact, applying quantization directly to collectives like AllReduce is inherently difficult due to the inter-device summation involved, which can lead to numerical instability or significant error accumulation. In this work, we present a native dynamic block-wise efficient quantized AllReduce within the XLA compiler for TPUs (EQuARX). By using TPU-friendly quantization and deep pipelining of communication and compute, EQuARX with int8 precision achieves a 1. 8 speedup over baseline BF16 AllReduce across various network topologies. Furthermore, EQuARX accelerates the prefill stage of Gemma 3 27B by 1.25 and Gemma 3 12B by 1.1, respectively, with small to negligible impact on quality.
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > Canada (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)