AITopics | Rush, Keith

Collaborating Authors

Rush, Keith

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Communication-Efficient Language Model Training Scales Reliably and Robustly: Scaling Laws for DiLoCo

Charles, Zachary, Teston, Gabriel, Dery, Lucio, Rush, Keith, Fallen, Nova, Garrett, Zachary, Szlam, Arthur, Douillard, Arthur

arXiv.org Artificial IntelligenceMar-12-2025

As we scale to more massive machine learning models, the frequent synchronization demands inherent in data-parallel approaches create significant slowdowns, posing a critical challenge to further scaling. Recent work develops an approach (DiLoCo) that relaxes synchronization demands without compromising model quality. However, these works do not carefully analyze how DiLoCo's behavior changes with model size. In this work, we study the scaling law behavior of DiLoCo when training LLMs under a fixed compute budget. We focus on how algorithmic factors, including number of model replicas, hyperparameters, and token budget affect training in ways that can be accurately predicted via scaling laws. We find that DiLoCo scales both predictably and robustly with model size. When well-tuned, DiLoCo scales better than data-parallel training with model size, and can outperform data-parallel training even at small model sizes. Our results showcase a more general set of benefits of DiLoCo than previously documented, including increased optimal batch sizes, improved downstream generalization with scale, and improved evaluation loss for a fixed token budget.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2503.09799

Country: Asia > Middle East (0.14)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Streaming DiLoCo with overlapping communication: Towards a Distributed Free Lunch

Douillard, Arthur, Donchev, Yanislav, Rush, Keith, Kale, Satyen, Charles, Zachary, Garrett, Zachary, Teston, Gabriel, Lacey, Dave, McIlroy, Ross, Shen, Jiajun, Ramé, Alexandre, Szlam, Arthur, Ranzato, Marc'Aurelio, Barham, Paul

arXiv.org Artificial IntelligenceJan-30-2025

Training of large language models (LLMs) is typically distributed across a large number of accelerators to reduce training time. Since internal states and parameter gradients need to be exchanged at each and every single gradient step, all devices need to be co-located using low-latency high-bandwidth communication links to support the required high volume of exchanged bits. Recently, distributed algorithms like DiLoCo have relaxed such co-location constraint: accelerators can be grouped into ``workers'', where synchronizations between workers only occur infrequently. This in turn means that workers can afford being connected by lower bandwidth communication links without affecting learning quality. However, in these methods, communication across workers still requires the same peak bandwidth as before, as the synchronizations require all parameters to be exchanged across all workers. In this paper, we improve DiLoCo in three ways. First, we synchronize only subsets of parameters in sequence, rather than all at once, which greatly reduces peak bandwidth. Second, we allow workers to continue training while synchronizing, which decreases wall clock time. Third, we quantize the data exchanged by workers, which further reduces bandwidth across workers. By properly combining these modifications, we show experimentally that we can distribute training of billion-scale parameters and reach similar quality as before, but reducing required bandwidth by two orders of magnitude.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2501.18512

Country: Asia > Middle East (0.46)

Genre: Research Report (0.85)

Industry: Leisure & Entertainment (0.46)

Technology:

Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Fine-Tuning Large Language Models with User-Level Differential Privacy

Charles, Zachary, Ganesh, Arun, McKenna, Ryan, McMahan, H. Brendan, Mitchell, Nicole, Pillutla, Krishna, Rush, Keith

arXiv.org Artificial IntelligenceJul-10-2024

We investigate practical and scalable algorithms for training large language models (LLMs) with user-level differential privacy (DP) in order to provably safeguard all the examples contributed by each user. We study two variants of DP-SGD with: (1) example-level sampling (ELS) and per-example gradient clipping, and (2) user-level sampling (ULS) and per-user gradient clipping. We derive a novel user-level DP accountant that allows us to compute provably tight privacy guarantees for ELS. Using this, we show that while ELS can outperform ULS in specific settings, ULS generally yields better results when each user has a diverse collection of examples. We validate our findings through experiments in synthetic mean estimation and LLM fine-tuning tasks under fixed compute budgets. We find that ULS is significantly better in settings where either (1) strong privacy guarantees are required, or (2) the compute budget is large. Notably, our focus on LLM-compatible training algorithms allows us to scale to models with hundreds of millions of parameters and datasets with hundreds of thousands of users.

artificial intelligence, large language model, natural language, (17 more...)

arXiv.org Artificial Intelligence

2407.07737

Country: North America > United States > California > San Francisco County > San Francisco (0.14)

Genre: Research Report > New Finding (0.48)

Industry: Information Technology > Security & Privacy (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Cascade-Aware Training of Language Models

Wang, Congchao, Augenstein, Sean, Rush, Keith, Jitkrittum, Wittawat, Narasimhan, Harikrishna, Rawat, Ankit Singh, Menon, Aditya Krishna, Go, Alec

arXiv.org Artificial IntelligenceMay-29-2024

Reducing serving cost and latency is a fundamental concern for the deployment of language models (LMs) in business applications. To address this, cascades of LMs offer an effective solution that conditionally employ smaller models for simpler queries. Cascaded systems are typically built with independently trained models, neglecting the advantages of considering inference-time interactions of the cascaded LMs during training. In this paper, we present cascade-aware training(CAT), an approach to optimizing the overall quality-cost performance tradeoff of a cascade of LMs. We achieve inference-time benefits by training the small LM with awareness of its place in a cascade and downstream capabilities. We demonstrate the value of the proposed method with over 60 LM tasks of the SuperGLUE, WMT22, and FLAN2021 datasets.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2406.0006

Country:

North America > United States > Texas (0.14)
North America > United States > California (0.14)
North America > United States > Arizona (0.14)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.70)

Add feedback

(Amplified) Banded Matrix Factorization: A unified approach to private training

Choquette-Choo, Christopher A., Ganesh, Arun, McKenna, Ryan, McMahan, H. Brendan, Rush, Keith, Thakurta, Abhradeep, Xu, Zheng

arXiv.org Artificial IntelligenceNov-1-2023

Matrix factorization (MF) mechanisms for differential privacy (DP) have substantially improved the state-of-the-art in privacy-utility-computation tradeoffs for ML applications in a variety of scenarios, but in both the centralized and federated settings there remain instances where either MF cannot be easily applied, or other algorithms provide better tradeoffs (typically, as $\epsilon$ becomes small). In this work, we show how MF can subsume prior state-of-the-art algorithms in both federated and centralized training settings, across all privacy budgets. The key technique throughout is the construction of MF mechanisms with banded matrices (lower-triangular matrices with at most $\hat{b}$ nonzero bands including the main diagonal). For cross-device federated learning (FL), this enables multiple-participations with a relaxed device participation schema compatible with practical FL infrastructure (as demonstrated by a production deployment). In the centralized setting, we prove that banded matrices enjoy the same privacy amplification results as the ubiquitous DP-SGD algorithm, but can provide strictly better performance in most scenarios -- this lets us always at least match DP-SGD, and often outperform it.

artificial intelligence, banded matrix factorization, machine learning, (3 more...)

arXiv.org Artificial Intelligence

2306.08153

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Gradient Descent with Linearly Correlated Noise: Theory and Applications to Differential Privacy

Koloskova, Anastasia, McKenna, Ryan, Charles, Zachary, Rush, Keith, McMahan, Brendan

arXiv.org Artificial IntelligenceJun-23-2023

We study gradient descent under linearly correlated noise. Our work is motivated by recent practical methods for optimization with differential privacy (DP), such as DP-FTRL, which achieve strong performance in settings where privacy amplification techniques are infeasible (such as in federated learning). These methods inject privacy noise through a matrix factorization mechanism, making the noise linearly correlated over iterations. We propose a simplified setting that distills key facets of these methods and isolates the impact of linearly correlated noise. We analyze the behavior of gradient descent in this setting, for both convex and non-convex functions. Our analysis is demonstrably tighter than prior work and recovers multiple important special cases exactly (including anticorrelated perturbed gradient descent). We use our results to develop new, effective matrix factorizations for differentially private optimization, and highlight the benefits of these factorizations theoretically and empirically.

algorithm, artificial intelligence, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2302.01463

Country:

Asia (0.14)
Europe (0.14)

Genre: Research Report > New Finding (0.66)

Industry: Information Technology > Security & Privacy (0.67)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Add feedback

Multi-Epoch Matrix Factorization Mechanisms for Private Machine Learning

Choquette-Choo, Christopher A., McMahan, H. Brendan, Rush, Keith, Thakurta, Abhradeep

arXiv.org Artificial IntelligenceJun-8-2023

We introduce new differentially private (DP) mechanisms for gradient-based machine learning (ML) with multiple passes (epochs) over a dataset, substantially improving the achievable privacy-utility-computation tradeoffs. We formalize the problem of DP mechanisms for adaptive streams with multiple participations and introduce a non-trivial extension of online matrix factorization DP mechanisms to our setting. This includes establishing the necessary theory for sensitivity calculations and efficient computation of optimal matrices. For some applications like $>\!\! 10,000$ SGD steps, applying these optimal techniques becomes computationally expensive. We thus design an efficient Fourier-transform-based mechanism with only a minor utility loss. Extensive empirical evaluation on both example-level DP for image classification and user-level DP for language modeling demonstrate substantial improvements over all previous methods, including the widely-used DP-SGD . Though our primary application is to ML, our main DP results are applicable to arbitrary linear queries and hence may have much broader applicability.

data quality, machine learning, mechanism, (15 more...)

arXiv.org Artificial Intelligence

2211.0653

Country:

North America > United States > Texas (0.14)
North America > United States > Hawaii (0.14)

Genre: Research Report (1.00)

Industry: Information Technology (0.46)

Technology:

Information Technology > Data Science > Data Quality > Data Transformation (0.48)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)

Add feedback

Federated Automatic Differentiation

Rush, Keith, Charles, Zachary, Garrett, Zachary

arXiv.org Artificial IntelligenceJan-18-2023

Federated learning (FL) is a general framework for learning across heterogeneous clients while preserving data privacy, under the orchestration of a central server. FL methods often compute gradients of loss functions purely locally (ie. entirely at each client, or entirely at the server), typically using automatic differentiation (AD) techniques. We propose a federated automatic differentiation (FAD) framework that 1) enables computing derivatives of functions involving client and server computation as well as communication between them and 2) operates in a manner compatible with existing federated technology. In other words, FAD computes derivatives across communication boundaries. We show, in analogy with traditional AD, that FAD may be implemented using various accumulation modes, which introduce distinct computation-communication trade-offs and systems requirements. Further, we show that a broad class of federated computations is closed under these various modes of FAD, implying in particular that if the original computation can be implemented using privacy-preserving primitives, its derivative may be computed using only these same primitives. We then show how FAD can be used to create algorithms that dynamically learn components of the algorithm itself. In particular, we show that FedAvg-style algorithms can exhibit significantly improved performance by using FAD to adjust the server optimization step automatically, or by using FAD to learn weighting schemes for computing weighted averages across clients.

artificial intelligence, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2301.07806

Country: North America > United States (0.67)

Genre: Research Report (1.00)

Industry: Information Technology > Security & Privacy (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
Information Technology > Artificial Intelligence > Natural Language (0.92)

Add feedback

Improved Differential Privacy for SGD via Optimal Private Linear Operators on Adaptive Streams

Denisov, Sergey, McMahan, Brendan, Rush, Keith, Smith, Adam, Thakurta, Abhradeep Guha

arXiv.org Artificial IntelligenceJan-17-2023

Motivated by recent applications requiring differential privacy over adaptive streams, we investigate the question of optimal instantiations of the matrix mechanism in this setting. We prove fundamental theoretical results on the applicability of matrix factorizations to adaptive streams, and provide a parameter-free fixed-point algorithm for computing optimal factorizations. We instantiate this framework with respect to concrete matrices which arise naturally in machine learning, and train user-level differentially private models with the resulting optimal mechanisms, yielding significant improvements in a notable problem in federated learning with user-level differential privacy.

artificial intelligence, machine learning, mechanism, (17 more...)

arXiv.org Artificial Intelligence

2202.08312

Country: North America > United States (1.00)

Genre: Research Report (0.82)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.46)

Add feedback

Dimension Independence in Unconstrained Private ERM via Adaptive Preconditioning

Kairouz, Peter, Ribero, Mónica, Rush, Keith, Thakurta, Abhradeep

arXiv.org Machine LearningAug-14-2020

In this paper we revisit the problem of private empirical risk minimziation (ERM) with differential privacy. We show that for unconstrained convex empirical risk minimization if the observed gradients of the objective function along the path of private gradient descent lie in a low-dimensional subspace (smaller than the ambient dimensionality of $p$), then using noisy adaptive preconditioning (a.k.a., noisy Adaptive Gradient Descent (AdaGrad)) we obtain a regret composed of two terms: a constant multiplicative factor of the original AdaGrad regret and an additional regret due to noise. In particular, we show that if the gradients lie in a constant rank subspace, then one can achieve an excess empirical risk of $ \tilde{O}(1/\epsilon n)$, compared to the worst-case achievable bound of $\tilde{O}(\sqrt{p}/\epsilon n)$. While previous works show dimension independent excess empirical risk bounds for the restrictive setting of convex generalized linear problems optimized over unconstrained subspaces, our results operate with general convex functions in unconstrained minimization. Along the way, we do a perturbation analysis of noisy AdaGrad, which may be of independent interest.

artificial intelligence, machine learning, optimization problem, (16 more...)

arXiv.org Machine Learning

2008.0657

Country:

North America > United States > Texas (0.14)
North America > Canada > Quebec (0.14)

Genre: Research Report > New Finding (0.34)

Industry: Information Technology > Security & Privacy (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.55)

Add feedback