AITopics | all-reduce operation

Collaborating Authors

all-reduce operation

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Tensor-Parallelism with Partially Synchronized Activations

Neural Information Processing SystemsJun-21-2026, 07:00:40 GMT

Training and inference of Large Language Models (LLMs) with tensor-parallelism requires substantial communication to synchronize activations. Our findings suggest that with a few minor adjustments to current practices, LLMs can be trained without fully synchronizing activations, reducing bandwidth demands. We name this "Communication-Aware Architecture for Tensor-parallelism" (CAAT-Net). We train a 7B parameter CAAT-Net model and show that tensor-parallel communication can be reduced by up to 50% with no significant drop in pretraining accuracy across nearly all evaluated benchmarks. We also experiment with smaller 130M and 1.1B models to show the robustness and scalability of our method. We find that, in some scenarios, validation loss can even improve when reducing communication. Finally, we demonstrate how CAAT-Net accelerates both training and inference workloads across various settings and model sizes.

large language model, machine learning, natural language, (20 more...)

Neural Information Processing Systems

Country: Asia > Middle East > Israel (0.28)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.66)

Industry:

Information Technology (0.46)
Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Tensor-Parallelism with Partially Synchronized Activations

Lamprecht, Itay, Karnieli, Asaf, Hanani, Yair, Giladi, Niv, Soudry, Daniel

arXiv.org Artificial IntelligenceDec-2-2025

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2506.19645

Country: Asia > Middle East > Israel (0.28)

Genre: Research Report > New Finding (0.86)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

A Look Into Training Large Language Models on Next Generation Datacenters

Gherghescu, Alexandru M., Bădoiu, Vlad-Andrei, Agache, Alexandru, Dumitru, Mihai-Valentin, Vasilescu, Iuliu, Mantu, Radu, Raiciu, Costin

arXiv.org Artificial IntelligenceJul-1-2024

Is it still worth doing computer networking research? What are relevant problems in this space given the supremacy of hyperscalers in deployed large networks? We take an unconventional approach to finding relevant research directions, by starting from Microsoft's plans to build a $100 billion datacenter for ML. Our goal is to understand what models could be trained in such a datacenter, as well as the high-level challenges one may encounter in doing so. We first examine the constraints imposed by cooling and power requirements for our target datacenter and find that it is infeasible to build in a single location. We use LLM scaling laws to determine that we could train models of 50T or 100T. Finally, we examine how distributed training might work for these models, and what the networking requirements are. We conclude that building the datacenter and training such models is technically possible, but this requires a novel NIC-based multipath transport along with a redesign of the entire training stack, outlining a research agenda for our community in the near future.

communication, datacenter, gpus, (12 more...)

arXiv.org Artificial Intelligence

2407.12819

Country:

Europe > Romania > București - Ilfov Development Region > Municipality of Bucharest > Bucharest (0.06)
North America > United States > California (0.04)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
(3 more...)

Genre: Research Report (1.00)

Industry:

Energy > Power Industry (1.00)
Information Technology (0.66)

Technology:

Information Technology > Communications > Networks (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Communication Algorithm-Architecture Co-Design for Distributed Deep Learning

#artificialintelligenceJun-16-2021, 14:56:49 GMT

Abstract--Large-scale distributed deep learning training has enabled developments of more complex deep neural network models to learn from larger datasets for sophisticated tasks. In particular, distributed stochastic gradient descent intensively invokes all-reduce operations for gradient update, which dominates communication time during iterative training epochs. In this work, we identify the inefficiency in widely used allreduce algorithms, and the opportunity of algorithm-architecture co-design. We propose MULTITREE all-reduce algorithm with topology and resource utilization awareness for efficient and scalable all-reduce operations, which is applicable to different interconnect topologies. Moreover, we co-design the network interface to schedule and coordinate the all-reduce messages for contention-free communications, working in synergy with the algorithm. The flow control is also simplified to exploit the bulk data transfer of big gradient exchange. We evaluate the co-design using different all-reduce data sizes for synthetic study, demonstrating its effectiveness on various interconnection network topologies, in addition to state-of-the-art deep neural networks for real workload experiments. The results show that MULTITREE achieves 2.3 and 1.56 communication speedup, as well as up to 81% and 30% training time reduction compared to ring all-reduce and state-of-the-art approaches, respectively.

all-reduce operation, communication algorithm-architecture co-design, deep learning, (1 more...)

#artificialintelligence

Country: North America > United States > Texas (0.24)

Genre: Research Report (0.61)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.99)

Add feedback

DC-S3GD: Delay-Compensated Stale-Synchronous SGD for Large-Scale Decentralized Neural Network Training

Rigazzi, Alessandro

arXiv.org Machine LearningNov-6-2019

--Data parallelism has become the de facto standard for training Deep Neural Network on multiple processing units. In this work we propose DC-S3GD, a decentralized (without Parameter Server) stale-synchronous version of the Delay-Compensated Asynchronous Stochastic Gradient Descent (DC-ASGD) algorithm. In our approach, we allow for the overlap of computation and communication, and compensate the inherent error with a first-order correction of the gradients. We prove the effectiveness of our approach by training Convolutional Neural Network with large batches and achieving state-of- the-art results. I NTRODUCTION Training Deep Neural Networks (DNNs) is a time-and resource-consuming problem. For example, to train a DNN to state-of-the-art accuracy on a single processing unit, the total time needed is in the order of magnitude of days, or even weeks [16]. For this reason, in recent years, several algorithms have been developed to allow users to perform parallel or distributed training of DNNs [7].

algorithm, gradient, iteration, (14 more...)

arXiv.org Machine Learning

1911.02516

Country:

North America > United States > California > San Diego County > San Diego (0.04)
Europe > Switzerland > Basel-City > Basel (0.04)

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback