AITopics | all-to-all communication

Collaborating Authors

all-to-all communication

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

LSH-MoE: Communication-efficient MoE Training via Locality-Sensitive Hashing

Neural Information Processing SystemsFeb-15-2026, 06:42:26 GMT

GPUs remains a significant challenge.

large language model, machine learning, natural language, (20 more...)

Neural Information Processing Systems

Country:

Asia > China > Beijing > Beijing (0.04)
North America > Canada > Quebec > Montreal (0.04)
Asia > Middle East > Jordan (0.04)
Asia > China > Shandong Province > Qingdao (0.04)

Genre: Research Report > Experimental Study (0.93)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.67)

Add feedback

NeurIPS2021_emergent_group_communication (7).pdf

Neural Information Processing SystemsFeb-9-2026, 23:13:41 GMT

We generate 128,000 images as agents' observations using python's matplotlib library Hunter [2007] V ariational autoencoder [Kingma and Welling, 2014] is used to encode the observations. Input is flatted 30,720-dimensional vector (32 by 320 by 3). Both encoder and decoder have one hidden layer with the dimension size being 1,024. The output (communication message) is a 10-dimensional vector. ReLU is used as the activation function.

agent, artificial intelligence, machine learning, (18 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.99)

Add feedback

LSH-MoE: Communication-efficient MoE Training via Locality-Sensitive Hashing

Neural Information Processing SystemsOct-10-2025, 04:18:59 GMT

GPUs remains a significant challenge.

all-to-all communication, experiment, hash function, (15 more...)

Neural Information Processing Systems

Country:

Asia > China > Beijing > Beijing (0.04)
North America > Canada > Quebec > Montreal (0.04)
Asia > Middle East > Jordan (0.04)
Asia > China > Shandong Province > Qingdao (0.04)

Genre: Research Report > Experimental Study (0.93)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.67)

Add feedback

Occult: Optimizing Collaborative Communication across Experts for Accelerated Parallel MoE Training and Inference

Luo, Shuqing, Li, Pingzhi, Peng, Jie, Wang, Hanrui, Yang, null, Zhao, null, Yu, null, Cao, null, Cheng, Yu, Chen, Tianlong

arXiv.org Artificial IntelligenceMay-20-2025

Mixture-of-experts (MoE) architectures could achieve impressive computational efficiency with expert parallelism, which relies heavily on all-to-all communication across devices. Unfortunately, such communication overhead typically constitutes a significant portion of the total runtime, hampering the scalability of distributed training and inference for modern MoE models (consuming over $40\%$ runtime in large-scale training). In this paper, we first define collaborative communication to illustrate this intrinsic limitation, and then propose system- and algorithm-level innovations to reduce communication costs. Specifically, given a pair of experts co-activated by one token, we call them "collaborated", which comprises $2$ cases as intra- and inter-collaboration, depending on whether they are kept on the same device. Our pilot investigations reveal that augmenting the proportion of intra-collaboration can accelerate expert parallelism at scale. It motivates us to strategically optimize collaborative communication for accelerated MoE training and inference, dubbed Occult. Our designs are capable of either delivering exact results with reduced communication cost or controllably minimizing the cost with collaboration pruning, materialized by modified fine-tuning. Comprehensive experiments on various MoE-LLMs demonstrate that Occult can be faster than popular state-of-the-art inference or training frameworks (more than $1.5\times$ speed up across multiple tasks and models) with comparable or superior quality compared to the standard fine-tuning. Code is available at $\href{https://github.com/UNITES-Lab/Occult}{https://github.com/UNITES-Lab/Occult}$.

large language model, machine learning, natural language, (14 more...)

arXiv.org Artificial Intelligence

2505.13345

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.14)
North America > United States > North Carolina (0.04)
North America > United States > Minnesota (0.04)
(4 more...)

Genre: Research Report > New Finding (0.46)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

BigMac: A Communication-Efficient Mixture-of-Experts Model Structure for Fast Training and Inference

Jin, Zewen, Wang, Shengnan, Zhu, Jiaan, Zhan, Hongrui, Bai, Youhui, Zhang, Lin, Ming, Zhenyu, Li, Cheng

arXiv.org Artificial IntelligenceMar-7-2025

The Mixture-of-Experts (MoE) structure scales the Transformer-based large language models (LLMs) and improves their performance with only the sub-linear increase in computation resources. Recently, a fine-grained DeepSeekMoE structure is proposed, which can further improve the computing efficiency of MoE without performance degradation. However, the All-to-All communication introduced by MoE has become a bottleneck, especially for the fine-grained structure, which typically involves and activates more experts, hence contributing to heavier communication overhead. In this paper, we propose a novel MoE structure named BigMac, which is also fine-grained but with high communication efficiency. The innovation of BigMac is mainly due to that we abandon the \textbf{c}ommunicate-\textbf{d}escend-\textbf{a}scend-\textbf{c}ommunicate (CDAC) manner used by fine-grained MoE, which leads to the All-to-All communication always taking place at the highest dimension. Instead, BigMac designs an efficient \textbf{d}escend-\textbf{c}ommunicate-\textbf{c}ommunicate-\textbf{a}scend (DCCA) manner. Specifically, we add a descending and ascending projection at the entrance and exit of the expert, respectively, which enables the communication to perform at a very low dimension. Furthermore, to adapt to DCCA, we re-design the structure of small experts, ensuring that the expert in BigMac has enough complexity to address tokens. Experimental results show that BigMac achieves comparable or even better model quality than fine-grained MoEs with the same number of experts and a similar number of total parameters. Equally importantly, BigMac reduces the end-to-end latency by up to 3.09$\times$ for training and increases the throughput by up to 3.11$\times$ for inference on state-of-the-art AI computing frameworks including Megatron, Tutel, and DeepSpeed-Inference.

bigmac, communication, moe model, (16 more...)

arXiv.org Artificial Intelligence

2502.16927

Country:

Asia > China > Anhui Province > Hefei (0.04)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
Europe > Denmark > Capital Region > Copenhagen (0.04)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

mFabric: An Efficient and Scalable Fabric for Mixture-of-Experts Training

Liao, Xudong, Sun, Yijun, Tian, Han, Wan, Xinchen, Jin, Yilun, Wang, Zilong, Ren, Zhenghang, Huang, Xinyang, Li, Wenxue, Tse, Kin Fai, Zhong, Zhizhen, Liu, Guyue, Zhang, Ying, Ye, Xiaofeng, Zhang, Yiming, Chen, Kai

arXiv.org Artificial IntelligenceJan-7-2025

Mixture-of-Expert (MoE) models outperform conventional models by selectively activating different subnets, named \emph{experts}, on a per-token basis. This gated computation generates dynamic communications that cannot be determined beforehand, challenging the existing GPU interconnects that remain \emph{static} during the distributed training process. In this paper, we advocate for a first-of-its-kind system, called mFabric, that unlocks topology reconfiguration \emph{during} distributed MoE training. Towards this vision, we first perform a production measurement study and show that the MoE dynamic communication pattern has \emph{strong locality}, alleviating the requirement of global reconfiguration. Based on this, we design and implement a \emph{regionally reconfigurable high-bandwidth domain} on top of existing electrical interconnects using optical circuit switching (OCS), achieving scalability while maintaining rapid adaptability. We have built a fully functional mFabric prototype with commodity hardware and a customized collective communication runtime that trains state-of-the-art MoE models with \emph{in-training} topology reconfiguration across 32 A100 GPUs. Large-scale packet-level simulations show that mFabric delivers comparable performance as the non-blocking fat-tree fabric while boosting the training cost efficiency (e.g., performance per dollar) of four representative MoE models by 1.2$\times$--1.5$\times$ and 1.9$\times$--2.3$\times$ at 100 Gbps and 400 Gbps link bandwidths, respectively.

artificial intelligence, communication, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2501.03905

Country: North America > United States > Massachusetts (0.28)

Genre: Research Report (1.00)

Industry:

Telecommunications (0.47)
Energy > Oil & Gas > Upstream (0.34)

Technology:

Information Technology > Communications > Networks (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

Decentralized Input and State Estimation for Multi-agent System with Dynamic Topology and Heterogeneous Sensor Network

Wu, Zida, Mehta, Ankur

arXiv.org Artificial IntelligenceDec-12-2024

A crucial challenge in decentralized systems is state estimation in the presence of unknown inputs, particularly within heterogeneous sensor networks with dynamic topologies. While numerous consensus algorithms have been introduced, they often require extensive information exchange or multiple communication iterations to ensure estimation accuracy. This paper proposes an efficient algorithm that achieves an unbiased and optimal solution comparable to filters with full information about other agents. This is accomplished through the use of information filter decomposition and the fusion of inputs via covariance intersection. Our method requires only a single communication iteration for exchanging individual estimates between agents, instead of multiple rounds of information exchange, thus preserving agents' privacy by avoiding the sharing of explicit observations and system equations. Furthermore, to address the challenges posed by dynamic communication topologies, we propose two practical strategies to handle issues arising from intermittent observations and incomplete state estimation, thereby enhancing the robustness and accuracy of the estimation process. Experiments and ablation studies conducted in both stationary and dynamic environments demonstrate the superiority of our algorithm over other baselines. Notably, it performs as well as, or even better than, algorithms that have a global view of all neighbors.

algorithm, estimation, input estimation, (16 more...)

arXiv.org Artificial Intelligence

2410.00272

Country: North America > United States > California > Los Angeles County > Los Angeles (0.28)

Genre: Research Report (0.64)

Technology:

Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)

Add feedback

Optimizing Mixture-of-Experts Inference Time Combining Model Deployment and Communication Scheduling

Li, Jialong, Tripathi, Shreyansh, Rastogi, Lakshay, Lei, Yiming, Pan, Rui, Xia, Yiting

arXiv.org Artificial IntelligenceOct-22-2024

As machine learning models scale in size and complexity, their computational requirements become a significant barrier. Mixture-of-Experts (MoE) models alleviate this issue by selectively activating relevant experts. Despite this, MoE models are hindered by high communication overhead from all-to-all operations, low GPU utilization due to the synchronous communication constraint, and complications from heterogeneous GPU environments. This paper presents Aurora, which optimizes both model deployment and all-to-all communication scheduling to address these challenges in MoE inference. Aurora achieves minimal communication times by strategically ordering token transmissions in all-to-all communications. It improves GPU utilization by colocating experts from different models on the same device, avoiding the limitations of synchronous all-to-all communication. We analyze Aurora's optimization strategies theoretically across four common GPU cluster settings: exclusive vs. colocated models on GPUs, and homogeneous vs. heterogeneous GPUs. Aurora provides optimal solutions for three cases, and for the remaining NP-hard scenario, it offers a polynomial-time sub-optimal solution with only a 1.07x degradation from the optimal. Aurora is the first approach to minimize MoE inference time via optimal model deployment and communication scheduling across various scenarios. Evaluations demonstrate that Aurora significantly accelerates inference, achieving speedups of up to 2.38x in homogeneous clusters and 3.54x in heterogeneous environments. Moreover, Aurora enhances GPU utilization by up to 1.5x compared to existing methods.

artificial intelligence, inference time, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2410.17043

Country: Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Accelerating Communication in Deep Learning Recommendation Model Training with Dual-Level Adaptive Lossy Compression

Feng, Hao, Zhang, Boyuan, Ye, Fanjiang, Si, Min, Chu, Ching-Hsiang, Tian, Jiannan, Yin, Chunxing, Deng, Summer, Hao, Yuchen, Balaji, Pavan, Geng, Tong, Tao, Dingwen

arXiv.org Artificial IntelligenceJul-11-2024

Abstract--DLRM is a state-of-the-art recommendation system model that has gained widespread adoption across various industry applications. This setup necessitates the use of collective communication primitives for Deep Learning Recommendation Models (DLRMs) have synchronization across all GPUs. Specifically, the partitioning significantly risen to prominence in both research and industry of sparse embedding tables requires nodes to aggregate sparse sectors in recent years. These models integrate sparse input embedding lookups during forward passes and their corresponding embedding learning with neural network architectures, marking gradients during backward passes. Consequently, allto-all a notable advance over traditional collaborative filteringbased communication is utilized in both forward and backward recommendation systems [1]. DLRMs have been successfully passes for synchronizing sparse lookups and gradients, while implemented in various industry applications, including all-reduce is employed for synchronizing dense/MLP gradients product recommendations system by Amazon [2], personalized during the backward pass. As a result, they constitute a significant portion gradients across all GPUs during each minibatch iteration significantly of deep learning applications across multiple industries. For example, DLRMs are uniquely designed to process high-dimensional Figure 1 shows that all-to-all communication accounts for categorical features, typically represented by one-or multihot more than 60% of the total training time for DLRM on an vectors matching the size of the category, which leads to 8-node, 32 A100 GPUs cluster (connected through a Slingshot significant data sparsity.

compression, compression ratio, compressor, (15 more...)

arXiv.org Artificial Intelligence

2407.04272

Country:

Asia > India > Karnataka > Bengaluru (0.04)
North America > United States > Texas > Dallas County > Dallas (0.04)
North America > United States > New York > Monroe County > Rochester (0.04)
(7 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback