AITopics | all-reduce

97275a23ca44226c9964043c8462be96-Paper.pdf

Neural Information Processing SystemsFeb-10-2026, 03:00:03 GMT

international conference, optimization, proceedings, (12 more...)

Neural Information Processing Systems

Country:

Asia > Russia (0.14)
North America > Canada > Ontario > Toronto (0.14)
North America > United States > Massachusetts (0.04)
(2 more...)

Industry:

Health & Medicine (0.93)
Information Technology (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Data Science (0.93)
(2 more...)

Add feedback

DropCompute: simple and more robust distributed synchronous training via compute variance reduction

Neural Information Processing SystemsDec-26-2025, 09:29:57 GMT

Background: Distributed training is essential for large scale training of deep neural networks (DNNs). The dominant methods for large scale DNN training are synchronous (e.g. All-Reduce), but these require waiting for all workers in each step. Thus, these methods are limited by the delays caused by straggling workers.Results: We study a typical scenario in which workers are straggling due to variability in compute time. We find an analytical relation between compute time properties and scalability limitations, caused by such straggling workers. With these findings, we propose a simple yet effective decentralized method to reduce the variation among workers and thus improve the robustness of synchronous training. This method can be integrated with the widely used All-Reduce. Our findings are validated on large-scale training tasks using 200 Gaudi Accelerators.

compute variance reduction, dropcompute, synchronous training, (3 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.61)

Add feedback

Moshpit SGD: Communication-Efficient Decentralized Training on Heterogeneous Unreliable Devices

Neural Information Processing SystemsDec-24-2025, 13:07:15 GMT

Training deep neural networks on large datasets can often be accelerated by using multiple compute nodes. This approach, known as distributed training, can utilize hundreds of computers via specialized message-passing protocols such as Ring All-Reduce.However, running these protocols at scale requires reliable high-speed networking that is only available in dedicated clusters.In contrast, many real-world applications, such as federated learning and cloud-based distributed training, operate on unreliable devices with unstable network bandwidth.As a result, these applications are restricted to using parameter servers or gossip-based averaging protocols.In this work, we lift that restriction by proposing Moshpit All-Reduce -- an iterative averaging protocol that exponentially converges to the global average.We demonstrate the efficiency of our protocol for distributed optimization with strong theoretical guarantees.The experiments show 1.3x speedup for ResNet-50 training on ImageNet compared to competitive gossip-based strategies and 1.5x speedup when training ALBERT-large on preemptible compute nodes.

communication-efficient decentralized training, heterogeneous unreliable device, moshpit sgd, (9 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.61)

Add feedback

An All-Reduce Compatible Top-K Compressor for Communication-Efficient Distributed Learning

Chen, Chuyan, Ma, Chenyang, Li, Zhangxin, He, Yutong, Dong, Yanjie, Yuan, Kun

arXiv.org Artificial IntelligenceNov-5-2025

Communication remains a central bottleneck in large-scale distributed machine learning, and gradient sparsification has emerged as a promising strategy to alleviate this challenge. However, existing gradient compressors face notable limitations: Rand-$K$ discards structural information and performs poorly in practice, while Top-$K$ preserves informative entries but loses the contraction property and requires costly All-Gather operations. In this paper, we propose ARC-Top-$K$, an {All-Reduce}-Compatible Top-$K$ compressor that aligns sparsity patterns across nodes using a lightweight sketch of the gradient, enabling index-free All-Reduce while preserving globally significant information. ARC-Top-$K$ is provably contractive and, when combined with momentum error feedback (EF21M), achieves linear speedup and sharper convergence rates than the original EF21M under standard assumptions. Empirically, ARC-Top-$K$ matches the accuracy of Top-$K$ while reducing wall-clock training time by up to 60.7\%, offering an efficient and scalable solution that combines the robustness of Rand-$K$ with the strong performance of Top-$K$.

arc-top-k, artificial intelligence, machine learning, (12 more...)

arXiv.org Artificial Intelligence

2510.26709

Country:

Asia > China > Beijing > Beijing (0.05)
Asia > China > Guangdong Province > Shenzhen (0.04)
Asia > Singapore (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

Moshpit SGD: Communication-Efficient Decentralized Training on Heterogeneous Unreliable Devices

Neural Information Processing SystemsAug-16-2025, 05:15:55 GMT

Training deep neural networks on large datasets can often be accelerated by using multiple compute nodes.

all-reduce, international conference, moshpit sgd, (11 more...)

Neural Information Processing Systems

Country:

Asia > Russia (0.14)
North America > Canada > Ontario > Toronto (0.14)
North America > United States > Massachusetts (0.04)
(2 more...)

Industry:

Health & Medicine (0.67)
Information Technology > Services (0.46)
Energy (0.45)

Technology:

Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
(2 more...)

Add feedback

97275a23ca44226c9964043c8462be96-Paper.pdf

Neural Information Processing SystemsAug-16-2025, 05:15:52 GMT

international conference, optimization, proceedings, (9 more...)

Neural Information Processing Systems

Country:

Asia > Russia (0.14)
North America > Canada > Ontario > Toronto (0.14)
North America > United States > New York > New York County > New York City (0.04)
(3 more...)

Industry:

Health & Medicine (0.67)
Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Data Science (0.93)
(2 more...)

Add feedback

Distributed Training under Packet Loss

Weintraub, Erez, Banner, Ron, Orda, Ariel

arXiv.org Artificial IntelligenceJul-11-2025

State-of-the-art language and vision models are routinely trained across thousands of GPUs, often spanning multiple data-centers, yet today's distributed frameworks still assume reliable connections (e.g., InfiniBand or RoCE). The resulting acknowledgment traffic and retransmissions inflate tail latencies and limit scalability. Leveraging unreliable connections will reduce latency but may sacrifice model accuracy and convergence once packets are dropped. A principled, end-to-end solution that preserves accuracy and convergence guarantees under genuine packet loss has previously been missing. We address this critical gap by introducing a novel distributed training framework capable of operating over unreliable connections, offering unbiased gradient aggregation and bounded parameter drift without modifying model code or optimizers. The key insight is a two-stage defense against missing messages: (i) Unbiased gradient aggregation: each worker reconstructs a consistent gradient estimate from whatever packets arrive, guaranteeing expectation-level correctness; and (ii) Bounded-drift parameter broadcasts: we prove the inter-worker model discrepancy remains O(1) even after arbitrarily many iterations, preventing the unbounded divergence typical of asynchronous setups. Analytical bounds are matched by experiments on the LLAMA2 7B model with 64 GPUs: tolerating 10% random packet loss yields at most 0.8% perplexity change. This work bridges the gap between communication-efficient datacenter protocols and the accuracy and generalization guarantees demanded by modern large-model training, enabling robust, high-throughput learning on commodity or wide-area networks.

artificial intelligence, deep learning, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2507.07114

Genre: Research Report (0.50)

Industry: Telecommunications > Networks (0.84)

Technology:

Information Technology > Communications > Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.36)

Add feedback

DropCompute: simple and more robust distributed synchronous training via compute variance reduction

Neural Information Processing SystemsJan-19-2025, 16:21:52 GMT

Background: Distributed training is essential for large scale training of deep neural networks (DNNs). The dominant methods for large scale DNN training are synchronous (e.g. All-Reduce), but these require waiting for all workers in each step. Thus, these methods are limited by the delays caused by straggling workers.Results: We study a typical scenario in which workers are straggling due to variability in compute time. We find an analytical relation between compute time properties and scalability limitations, caused by such straggling workers. With these findings, we propose a simple yet effective decentralized method to reduce the variation among workers and thus improve the robustness of synchronous training.

compute variance reduction, dropcompute, synchronous training, (1 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.66)

Add feedback

Moshpit SGD: Communication-Efficient Decentralized Training on Heterogeneous Unreliable Devices

Neural Information Processing SystemsJan-17-2025, 20:07:25 GMT

Training deep neural networks on large datasets can often be accelerated by using multiple compute nodes. This approach, known as distributed training, can utilize hundreds of computers via specialized message-passing protocols such as Ring All-Reduce.However, running these protocols at scale requires reliable high-speed networking that is only available in dedicated clusters.In contrast, many real-world applications, such as federated learning and cloud-based distributed training, operate on unreliable devices with unstable network bandwidth.As a result, these applications are restricted to using parameter servers or gossip-based averaging protocols.In this work, we lift that restriction by proposing Moshpit All-Reduce -- an iterative averaging protocol that exponentially converges to the global average.We demonstrate the efficiency of our protocol for distributed optimization with strong theoretical guarantees.The experiments show 1.3x speedup for ResNet-50 training on ImageNet compared to competitive gossip-based strategies and 1.5x speedup when training ALBERT-large on preemptible compute nodes.

communication-efficient decentralized training, heterogeneous unreliable device, protocol, (6 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.65)

Add feedback

Flash Communication: Reducing Tensor Parallelization Bottleneck for Fast Large Language Model Inference

Li, Qingyuan, Zhang, Bo, Ye, Liang, Zhang, Yifan, Wu, Wei, Sun, Yerui, Ma, Lin, Xie, Yuchen

arXiv.org Artificial IntelligenceDec-11-2024

The ever-increasing sizes of large language models necessitate distributed solutions for fast inference that exploit multi-dimensional parallelism, where computational loads are split across various accelerators such as GPU clusters. However, this approach often introduces significant communication overhead, especially on devices with limited bandwidth. In this paper, we introduce Flash Communication, a novel low-bit compression technique designed to alleviate the tensor-parallelism communication bottleneck during inference. Our method substantially boosts intra-node communication speed by more than 3x and reduces the time-to-first-token by 2x, with nearly no sacrifice in model accuracy. Extensive experiments on various up-to-date LLMs demonstrate the effectiveness of our approach.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2412.04964

Country:

Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
North America > United States > New York > New York County > New York City (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.71)

Add feedback

Collaborating Authors

all-reduce

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

97275a23ca44226c9964043c8462be96-Paper.pdf

DropCompute: simple and more robust distributed synchronous training via compute variance reduction

Moshpit SGD: Communication-Efficient Decentralized Training on Heterogeneous Unreliable Devices

An All-Reduce Compatible Top-K Compressor for Communication-Efficient Distributed Learning

Moshpit SGD: Communication-Efficient Decentralized Training on Heterogeneous Unreliable Devices

97275a23ca44226c9964043c8462be96-Paper.pdf

Distributed Training under Packet Loss

DropCompute: simple and more robust distributed synchronous training via compute variance reduction

Moshpit SGD: Communication-Efficient Decentralized Training on Heterogeneous Unreliable Devices

Flash Communication: Reducing Tensor Parallelization Bottleneck for Fast Large Language Model Inference