Goto

Collaborating Authors

 Fallen, Nova


Communication-Efficient Language Model Training Scales Reliably and Robustly: Scaling Laws for DiLoCo

arXiv.org Artificial Intelligence

As we scale to more massive machine learning models, the frequent synchronization demands inherent in data-parallel approaches create significant slowdowns, posing a critical challenge to further scaling. Recent work develops an approach (DiLoCo) that relaxes synchronization demands without compromising model quality. However, these works do not carefully analyze how DiLoCo's behavior changes with model size. In this work, we study the scaling law behavior of DiLoCo when training LLMs under a fixed compute budget. We focus on how algorithmic factors, including number of model replicas, hyperparameters, and token budget affect training in ways that can be accurately predicted via scaling laws. We find that DiLoCo scales both predictably and robustly with model size. When well-tuned, DiLoCo scales better than data-parallel training with model size, and can outperform data-parallel training even at small model sizes. Our results showcase a more general set of benefits of DiLoCo than previously documented, including increased optimal batch sizes, improved downstream generalization with scale, and improved evaluation loss for a fixed token budget.


Confidential Federated Computations

arXiv.org Artificial Intelligence

Since its introduction in 2017 [48, 42], federated learning (FL) has seen adoption by technology platforms working with private on-device data (cross-device federated learning) or proprietary server-side data (crosssilo federated learning). FL's appeal has been driven by its straightforward privacy advantages: raw data stays in the control of participating entities, with only focused updates sent for immediate aggregation, visible to the service provider. Systems that realize federated learning [18, 35, 51] run at scale today, reducing privacy risks in sensitive applications like mobile keyboards [33, 63, 21, 53] and voice assistants [12, 34]. However, basic federated learning offers an incomplete privacy story [19]: updates sent to the service provider can reveal private data unless updates are aggregated obliviously, and aggregated updates can encode individual data unless trained with a differentially private (DP) learning algorithm [30]. A dishonest service provider might log or inspect unaggregated messages, from which a great deal of information about an individual participant can be learned [15, 57]. This risk has been addressed with oblivious aggregation schemes that guarantee the service provider cannot inspect unaggregated messages, including secure multiparty computation (SMPC) from cohorts of honest devices [17], non-colluding SMPC-based secure aggregators [58], or hardware trusted execution environments (TEEs) [35].