AITopics | low-communication training

Collaborating Authors

low-communication training

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

What happens when nanochat meets DiLoCo?

Acker, Alexander, Becker, Soeren, Nedelkoski, Sasho, Scheinert, Dominik, Kao, Odej, Wiesner, Philipp

arXiv.org Artificial IntelligenceNov-19-2025

Although LLM training is typically centralized with high-bandwidth interconnects and large compute budgets, emerging methods target communication-constrained training in distributed environments. The model trade-offs introduced by this shift remain underexplored, and our goal is to study them. We use the open-source nanochat project, a compact 8K-line full-stack ChatGPT-like implementation containing tokenization, pretraining, fine-tuning, and serving, as a controlled baseline. We implement the DiLoCo algorithm as a lightweight wrapper over nanochat's training loop, performing multiple local steps per worker before synchronization with an outer optimizer, effectively reducing communication by orders of magnitude. This inner-outer training is compared against a standard data-parallel (DDP) setup. Because nanochat is small and inspectable, it enables controlled pipeline adaptations and allows direct comparison with the conventional centralized baseline. DiLoCo achieves stable convergence and competitive loss in pretraining but yields worse MMLU, GSM8K, and HumanEval scores after mid-training and SFT. We discover that using DiLoCo-pretrained weights and running mid- and post-training with DDP fails to recover performance, revealing irreversible representation drift from asynchronous updates that impairs downstream alignment. We provide this implementation as an official fork of nanochat on GitHub.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2511.13761

Country: Europe > Germany > Berlin (0.04)

Genre: Research Report (0.85)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.35)

Add feedback

Distributed Low-Communication Training with Decoupled Momentum Optimization

Nedelkoski, Sasho, Acker, Alexander, Kao, Odej, Becker, Soeren, Scheinert, Dominik

arXiv.org Artificial IntelligenceOct-7-2025

The training of large models demands substantial computational resources, typically available only in data centers with high-bandwidth interconnects. However, reducing the reliance on high-bandwidth interconnects between nodes enables the use of distributed compute resources as an alternative to centralized data center training. Building on recent advances in distributed model training, we propose an approach that further reduces communication by combining infrequent synchronizations across distributed model replicas with gradient momentum compression. In particular, we treat the optimizer momentum as a signal and decompose the Nesterov momentum into high- and low-frequency components via the discrete cosine transform (DCT). Only the high-frequency components are synchronized across model replicas every $H$ steps. Empirically, our method achieves up to a $16\times$ reduction in communication compared to the baseline DiLoCo, and it generalizes across architectures, including transformer-based language models and convolutional neural networks for images. Overall, this work advances the feasibility of training large models on distributed nodes with low-bandwidth interconnects.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2510.03371

Country: Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)

Genre: Research Report (0.68)

Industry: Information Technology (0.54)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

OpenDiLoCo: An Open-Source Framework for Globally Distributed Low-Communication Training

Jaghouar, Sami, Ong, Jack Min, Hagemann, Johannes

arXiv.org Artificial IntelligenceJul-10-2024

OpenDiLoCo is an open-source implementation and replication of the Distributed Low-Communication (DiLoCo) training method for large language models. We provide a reproducible implementation of the DiLoCo experiments, offering it within a scalable, decentralized training framework using the Hivemind library. We demonstrate its effectiveness by training a model across two continents and three countries, while maintaining 90-95% compute utilization. Additionally, we conduct ablations studies focusing on the algorithm's compute efficiency, scalability in the number of workers and show that its gradients can be all-reduced using FP16 without any performance degradation. Furthermore, we scale OpenDiLoCo to 3x the size of the original work, demonstrating its effectiveness for billion parameter models.

low-communication training, open-source framework, opendiloco

arXiv.org Artificial Intelligence

2407.07852

Genre: Research Report (0.69)

Technology:

Information Technology > Software (0.60)
Information Technology > Artificial Intelligence > Natural Language (0.53)

Add feedback

DiLoCo: Distributed Low-Communication Training of Language Models

Douillard, Arthur, Feng, Qixuan, Rusu, Andrei A., Chhaparia, Rachita, Donchev, Yani, Kuncoro, Adhiguna, Ranzato, Marc'Aurelio, Szlam, Arthur, Shen, Jiajun

arXiv.org Artificial IntelligenceDec-2-2023

Large language models (LLM) have become a critical component in many applications of machine learning. However, standard approaches to training LLM require a large number of tightly interconnected accelerators, with devices exchanging gradients and other intermediate states at each optimization step. While it is difficult to build and maintain a single computing cluster hosting many accelerators, it might be easier to find several computing clusters each hosting a smaller number of devices. In this work, we propose a distributed optimization algorithm, Distributed Low-Communication (DiLoCo), that enables training of language models on islands of devices that are poorly connected. The approach is a variant of federated averaging, where the number of inner steps is large, the inner optimizer is AdamW, and the outer optimizer is Nesterov momentum. On the widely used C4 dataset, we show that DiLoCo on 8 workers performs as well as fully synchronous optimization while communicating 500 times less. DiLoCo exhibits great robustness to the data distribution of each worker. It is also robust to resources becoming unavailable over time, and vice versa, it can seamlessly leverage resources that become available during training.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2311.08105

Country: Asia > Middle East > Jordan (0.04)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback