AITopics | Charles, Zachary

Collaborating Authors

Charles, Zachary

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Communication-Efficient Language Model Training Scales Reliably and Robustly: Scaling Laws for DiLoCo

Charles, Zachary, Teston, Gabriel, Dery, Lucio, Rush, Keith, Fallen, Nova, Garrett, Zachary, Szlam, Arthur, Douillard, Arthur

arXiv.org Artificial IntelligenceMar-12-2025

As we scale to more massive machine learning models, the frequent synchronization demands inherent in data-parallel approaches create significant slowdowns, posing a critical challenge to further scaling. Recent work develops an approach (DiLoCo) that relaxes synchronization demands without compromising model quality. However, these works do not carefully analyze how DiLoCo's behavior changes with model size. In this work, we study the scaling law behavior of DiLoCo when training LLMs under a fixed compute budget. We focus on how algorithmic factors, including number of model replicas, hyperparameters, and token budget affect training in ways that can be accurately predicted via scaling laws. We find that DiLoCo scales both predictably and robustly with model size. When well-tuned, DiLoCo scales better than data-parallel training with model size, and can outperform data-parallel training even at small model sizes. Our results showcase a more general set of benefits of DiLoCo than previously documented, including increased optimal batch sizes, improved downstream generalization with scale, and improved evaluation loss for a fixed token budget.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2503.09799

Country: Asia > Middle East (0.14)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Scaling Laws for Differentially Private Language Models

McKenna, Ryan, Huang, Yangsibo, Sinha, Amer, Balle, Borja, Charles, Zachary, Choquette-Choo, Christopher A., Ghazi, Badih, Kaissis, George, Kumar, Ravi, Liu, Ruibo, Yu, Da, Zhang, Chiyuan

arXiv.org Artificial IntelligenceJan-31-2025

Scaling laws have emerged as important components of large language model (LLM) training as they can predict performance gains through scale, and provide guidance on important hyper-parameter choices that would otherwise be expensive. LLMs also rely on large, high-quality training datasets, like those sourced from (sometimes sensitive) user data. Training models on this sensitive user data requires careful privacy protections like differential privacy (DP). However, the dynamics of DP training are significantly different, and consequently their scaling laws are not yet fully understood. In this work, we establish scaling laws that accurately model the intricacies of DP LLM training, providing a complete picture of the compute-privacy-utility tradeoffs and the optimal training configurations in many settings.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2501.18914

Genre: Research Report > New Finding (0.93)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Streaming DiLoCo with overlapping communication: Towards a Distributed Free Lunch

Douillard, Arthur, Donchev, Yanislav, Rush, Keith, Kale, Satyen, Charles, Zachary, Garrett, Zachary, Teston, Gabriel, Lacey, Dave, McIlroy, Ross, Shen, Jiajun, Ramé, Alexandre, Szlam, Arthur, Ranzato, Marc'Aurelio, Barham, Paul

arXiv.org Artificial IntelligenceJan-30-2025

Training of large language models (LLMs) is typically distributed across a large number of accelerators to reduce training time. Since internal states and parameter gradients need to be exchanged at each and every single gradient step, all devices need to be co-located using low-latency high-bandwidth communication links to support the required high volume of exchanged bits. Recently, distributed algorithms like DiLoCo have relaxed such co-location constraint: accelerators can be grouped into ``workers'', where synchronizations between workers only occur infrequently. This in turn means that workers can afford being connected by lower bandwidth communication links without affecting learning quality. However, in these methods, communication across workers still requires the same peak bandwidth as before, as the synchronizations require all parameters to be exchanged across all workers. In this paper, we improve DiLoCo in three ways. First, we synchronize only subsets of parameters in sequence, rather than all at once, which greatly reduces peak bandwidth. Second, we allow workers to continue training while synchronizing, which decreases wall clock time. Third, we quantize the data exchanged by workers, which further reduces bandwidth across workers. By properly combining these modifications, we show experimentally that we can distribute training of billion-scale parameters and reach similar quality as before, but reducing required bandwidth by two orders of magnitude.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2501.18512

Country: Asia > Middle East (0.46)

Genre: Research Report (0.85)

Industry: Leisure & Entertainment (0.46)

Technology:

Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Fine-Tuning Large Language Models with User-Level Differential Privacy

Charles, Zachary, Ganesh, Arun, McKenna, Ryan, McMahan, H. Brendan, Mitchell, Nicole, Pillutla, Krishna, Rush, Keith

arXiv.org Artificial IntelligenceJul-10-2024

We investigate practical and scalable algorithms for training large language models (LLMs) with user-level differential privacy (DP) in order to provably safeguard all the examples contributed by each user. We study two variants of DP-SGD with: (1) example-level sampling (ELS) and per-example gradient clipping, and (2) user-level sampling (ULS) and per-user gradient clipping. We derive a novel user-level DP accountant that allows us to compute provably tight privacy guarantees for ELS. Using this, we show that while ELS can outperform ULS in specific settings, ULS generally yields better results when each user has a diverse collection of examples. We validate our findings through experiments in synthetic mean estimation and LLM fine-tuning tasks under fixed compute budgets. We find that ULS is significantly better in settings where either (1) strong privacy guarantees are required, or (2) the compute budget is large. Notably, our focus on LLM-compatible training algorithms allows us to scale to models with hundreds of millions of parameters and datasets with hundreds of thousands of users.

artificial intelligence, large language model, natural language, (17 more...)

arXiv.org Artificial Intelligence

2407.07737

Country: North America > United States > California > San Francisco County > San Francisco (0.14)

Genre: Research Report > New Finding (0.48)

Industry: Information Technology > Security & Privacy (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Towards Federated Foundation Models: Scalable Dataset Pipelines for Group-Structured Learning

Charles, Zachary, Mitchell, Nicole, Pillutla, Krishna, Reneer, Michael, Garrett, Zachary

arXiv.org Artificial IntelligenceDec-21-2023

We introduce Dataset Grouper, a library to create large-scale group-structured (e.g., federated) datasets, enabling federated learning simulation at the scale of foundation models. This library facilitates the creation of group-structured versions of existing datasets based on user-specified partitions and directly leads to a variety of useful heterogeneous datasets that can be plugged into existing software frameworks. Dataset Grouper offers three key advantages. First, it scales to settings where even a single group's dataset is too large to fit in memory. Second, it provides flexibility, both in choosing the base (non-partitioned) dataset and in defining partitions. Finally, it is framework-agnostic. We empirically demonstrate that Dataset Grouper enables large-scale federated language modeling simulations on datasets that are orders of magnitude larger than in previous work, allowing for federated training of language models with hundreds of millions, and even billions, of parameters. Our experimental results show that algorithms like FedAvg operate more as meta-learning methods than as empirical risk minimization methods at this scale, suggesting their utility in downstream personalization and task-specific adaptation. Dataset Grouper is available at https://github.com/google-research/dataset_grouper.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2307.09619

Genre: Research Report > New Finding (0.66)

Industry: Information Technology > Security & Privacy (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.67)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Leveraging Function Space Aggregation for Federated Learning at Scale

Dhawan, Nikita, Mitchell, Nicole, Charles, Zachary, Garrett, Zachary, Dziugaite, Gintare Karolina

arXiv.org Artificial IntelligenceNov-16-2023

The federated learning paradigm has motivated the development of methods for aggregating multiple client updates into a global server model, without sharing client data. Many federated learning algorithms, including the canonical Federated Averaging (FedAvg), take a direct (possibly weighted) average of the client parameter updates, motivated by results in distributed optimization. In this work, we adopt a function space perspective and propose a new algorithm, FedFish, that aggregates local approximations to the functions learned by clients, using an estimate based on their Fisher information. We evaluate FedFish on realistic, large-scale cross-device benchmarks. While the performance of FedAvg can suffer as client models drift further apart, we demonstrate that FedFish is more robust to longer local training. Our evaluation across several settings in image and language benchmarks shows that FedFish outperforms FedAvg as local training epochs increase. Further, Fed-Fish results in global networks that are more amenable to efficient personalization via local fine-tuning on the same or shifted data distributions. For instance, federated pretraining on the C4 dataset, followed by few-shot personalization on Stack Overflow, results in a 7% improvement in next-token prediction by FedFish over FedAvg.

artificial intelligence, fedfish, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2311.10291

Country: North America > Canada > Ontario > Toronto (0.14)

Genre: Research Report (0.64)

Industry: Education (0.70)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Add feedback

Gradient Descent with Linearly Correlated Noise: Theory and Applications to Differential Privacy

Koloskova, Anastasia, McKenna, Ryan, Charles, Zachary, Rush, Keith, McMahan, Brendan

arXiv.org Artificial IntelligenceJun-23-2023

We study gradient descent under linearly correlated noise. Our work is motivated by recent practical methods for optimization with differential privacy (DP), such as DP-FTRL, which achieve strong performance in settings where privacy amplification techniques are infeasible (such as in federated learning). These methods inject privacy noise through a matrix factorization mechanism, making the noise linearly correlated over iterations. We propose a simplified setting that distills key facets of these methods and isolates the impact of linearly correlated noise. We analyze the behavior of gradient descent in this setting, for both convex and non-convex functions. Our analysis is demonstrably tighter than prior work and recovers multiple important special cases exactly (including anticorrelated perturbed gradient descent). We use our results to develop new, effective matrix factorizations for differentially private optimization, and highlight the benefits of these factorizations theoretically and empirically.

algorithm, artificial intelligence, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2302.01463

Country:

Asia (0.14)
Europe (0.14)

Genre: Research Report > New Finding (0.66)

Industry: Information Technology > Security & Privacy (0.67)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Add feedback

Federated Automatic Differentiation

Rush, Keith, Charles, Zachary, Garrett, Zachary

arXiv.org Artificial IntelligenceJan-18-2023

Federated learning (FL) is a general framework for learning across heterogeneous clients while preserving data privacy, under the orchestration of a central server. FL methods often compute gradients of loss functions purely locally (ie. entirely at each client, or entirely at the server), typically using automatic differentiation (AD) techniques. We propose a federated automatic differentiation (FAD) framework that 1) enables computing derivatives of functions involving client and server computation as well as communication between them and 2) operates in a manner compatible with existing federated technology. In other words, FAD computes derivatives across communication boundaries. We show, in analogy with traditional AD, that FAD may be implemented using various accumulation modes, which introduce distinct computation-communication trade-offs and systems requirements. Further, we show that a broad class of federated computations is closed under these various modes of FAD, implying in particular that if the original computation can be implemented using privacy-preserving primitives, its derivative may be computed using only these same primitives. We then show how FAD can be used to create algorithms that dynamically learn components of the algorithm itself. In particular, we show that FedAvg-style algorithms can exhibit significantly improved performance by using FAD to adjust the server optimization step automatically, or by using FAD to learn weighting schemes for computing weighted averages across clients.

artificial intelligence, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2301.07806

Country: North America > United States (0.67)

Genre: Research Report (1.00)

Industry: Information Technology > Security & Privacy (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
Information Technology > Artificial Intelligence > Natural Language (0.92)

Add feedback

Optimizing the Communication-Accuracy Trade-off in Federated Learning with Rate-Distortion Theory

Mitchell, Nicole, Ballé, Johannes, Charles, Zachary, Konečný, Jakub

arXiv.org Machine LearningJan-7-2022

A significant bottleneck in federated learning is the network communication cost of sending model updates from client devices to the central server. We propose a method to reduce this cost. Our method encodes quantized updates with an appropriate universal code, taking into account their empirical distribution. Because quantization introduces error, we select quantization levels by optimizing for the desired trade-off in average total bitrate and gradient distortion. We demonstrate empirically that in spite of the non-i.i.d. nature of federated learning, the rate-distortion frontier is consistent across datasets, optimizers, clients and training rounds, and within each setting, distortion reliably predicts model performance. This allows for a remarkably simple compression scheme that is near-optimal in many use cases, and outperforms Top-K, DRIVE, 3LC and QSGD on the Stack Overflow next-word prediction benchmark.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Machine Learning

2201.02664

Country: North America > United States (0.14)

Genre: Research Report (0.50)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Local Adaptivity in Federated Learning: Convergence and Consistency

Wang, Jianyu, Xu, Zheng, Garrett, Zachary, Charles, Zachary, Liu, Luyang, Joshi, Gauri

arXiv.org Machine LearningJun-4-2021

The federated learning (FL) framework trains a machine learning model using decentralized data stored at edge client devices by periodically aggregating locally trained models. Popular optimization algorithms of FL use vanilla (stochastic) gradient descent for both local updates at clients and global updates at the aggregating server. Recently, adaptive optimization methods such as AdaGrad have been studied for server updates. However, the effect of using adaptive optimization methods for local updates at clients is not yet understood. We show in both theory and practice that while local adaptive methods can accelerate convergence, they can cause a non-vanishing solution bias, where the final converged solution may be different from the stationary point of the global objective function. We propose correction techniques to overcome this inconsistency and complement the local adaptive methods for FL. Extensive experiments on realistic federated training tasks show that the proposed algorithms can achieve faster convergence and higher test accuracy than the baselines without local adaptivity.

artificial intelligence, optimization problem, optimizer, (16 more...)

arXiv.org Machine Learning

2106.02305

Country: North America > United States (0.14)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.54)

Add feedback