AITopics | paquette

4+3 Phases of Compute-Optimal Neural Scaling Laws

Neural Information Processing SystemsMar-18-2026, 19:31:51 GMT

We consider the solvable neural scaling model with three parameters: data complexity, target complexity, and model-parameter-count. We use this neural scaling model to derive new predictions about the compute-limited, infinite-data scaling law regime. To train the neural scaling model, we run one-pass stochastic gradient descent on a mean-squared loss. We derive a representation of the loss curves which holds over all iteration counts and improves in accuracy as the model parameter count grows.

artificial intelligence, machine learning, proceedings, (5 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.60)

Add feedback

Implicit Regularization or Implicit Conditioning? Exact Risk Trajectories of SGD in High Dimensions

Neural Information Processing SystemsDec-25-2025, 14:56:28 GMT

Stochastic gradient descent (SGD) is a pillar of modern machine learning, serving as the go-to optimization algorithm for a diverse array of problems. While the empirical success of SGD is often attributed to its computational efficiency and favorable generalization behavior, neither effect is well understood and disentangling them remains an open problem. Even in the simple setting of convex quadratic problems, worst-case analyses give an asymptotic convergence rate for SGD that is no better than full-batch gradient descent (GD), and the purported implicit regularization effects of SGD lack a precise explanation. In this work, we study the dynamics of multi-pass SGD on high-dimensional convex quadratics and establish an asymptotic equivalence to a stochastic differential equation, which we call homogenized stochastic gradient descent (HSGD), whose solutions we characterize explicitly in terms of a Volterra integral equation. These results yield precise formulas for the learning and risk trajectories, which reveal a mechanism of implicit conditioning that explains the efficiency of SGD relative to GD. We also prove that the noise from SGD negatively impacts generalization performance, ruling out the possibility of any type of implicit regularization in this context. Finally, we show how to adapt the HSGD formalism to include streaming SGD, which allows us to produce an exact prediction for the excess risk of multi-pass SGD relative to that of streaming SGD (bootstrap risk).

exact risk trajectory, implicit conditioning, implicit regularization, (8 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Add feedback

High-Dimensional Privacy-Utility Dynamics of Noisy Stochastic Gradient Descent on Least Squares

Lin, Shurong, Kolaczyk, Eric D., Smith, Adam, Paquette, Elliot

arXiv.org Artificial IntelligenceOct-21-2025

The interplay between optimization and privacy has become a central theme in privacy-preserving machine learning. Noisy stochastic gradient descent (SGD) has emerged as a cornerstone algorithm, particularly in large-scale settings. These variants of gradient methods inject carefully calibrated noise into each update to achieve differential privacy, the gold standard notion of rigorous privacy guarantees. Prior work primarily provides various bounds on statistical risk and privacy loss for noisy SGD, yet the \textit{exact} behavior of the process remains unclear, particularly in high-dimensional settings. This work leverages a diffusion approach to analyze noisy SGD precisely, providing a continuous-time perspective that captures both statistical risk evolution and privacy loss dynamics in high dimensions. Moreover, we study a variant of noisy SGD that does not require explicit knowledge of gradient sensitivity, unlike existing work that assumes or enforces sensitivity through gradient clipping. Specifically, we focus on the least squares problem with $\ell_2$ regularization.

artificial intelligence, machine learning, noisy sgd, (16 more...)

arXiv.org Artificial Intelligence

2510.16687

Country: North America (0.28)

Genre: Research Report (0.50)

Industry: Information Technology > Security & Privacy (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Add feedback

4+3 Phases of Compute-Optimal Neural Scaling Laws

Neural Information Processing SystemsMay-26-2025, 18:03:29 GMT

We consider the solvable neural scaling model with three parameters: data complexity, target complexity, and model-parameter-count. We use this neural scaling model to derive new predictions about the compute-limited, infinite-data scaling law regime. To train the neural scaling model, we run one-pass stochastic gradient descent on a mean-squared loss. We derive a representation of the loss curves which holds over all iteration counts and improves in accuracy as the model parameter count grows. The phase boundaries are determined by the relative importance of model capacity, optimizer noise, and embedding of the features.

complexity, compute-optimal neural scaling law, phase, (1 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.64)

Add feedback

The High Line: Exact Risk and Learning Rate Curves of Stochastic Adaptive Learning Rate Algorithms

Neural Information Processing SystemsMay-26-2025, 15:58:55 GMT

We develop a framework for analyzing the training and learning rate dynamics on a large class of high-dimensional optimization problems, which we call the high line, trained using one-pass stochastic gradient descent (SGD) with adaptive learning rates. We give exact expressions for the risk and learning rate curves in terms of a deterministic solution to a system of ODEs. We then investigate in detail two adaptive learning rates -- an idealized exact line search and AdaGrad-Norm -- on the least squares problem. When the data covariance matrix has strictly positive eigenvalues, this idealized exact line search strategy can exhibit arbitrarily slower convergence when compared to the optimal fixed learning rate with SGD. Moreover we exactly characterize the limiting learning rate (as time goes to infinity) for line search in the setting where the data covariance has only two distinct eigenvalues.

artificial intelligence, machine learning, stochastic adaptive learning rate algorithm, (7 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.62)

Add feedback

Two-Point Deterministic Equivalence for Stochastic Gradient Dynamics in Linear Models

Atanasov, Alexander, Bordelon, Blake, Zavatone-Veth, Jacob A., Paquette, Courtney, Pehlevan, Cengiz

arXiv.org Machine LearningFeb-7-2025

Modern deep learning practice is governed by the surprising predictability of performance improvement with increases in the scale of data, model size, and compute [17]. Often, the scaling of performance as a function of these quantities exhibits remarkably regular power law behavior, termed a neural scaling law [2, 6, 12, 13, 15, 16, 18, 19, 22, 32]. Here, performance is usually measured by some differentiable loss on the predictions of the model on a held out test set representative of the population. Given the relatively universal behavior of the exponents across architectures and optimizers [11, 18, 19], one might hope that relatively simple models of information processing systems might be able to recover the same types of scaling laws. The (stochastic) gradient descent (SGD) dynamics in random feature models were analyzed in recent works [7, 20, 26] which exhibits a surprising breadth of scaling behavior and captures several interesting phenomena in deep network training. Each of the above works has isolated various effects that can hurt performance compared to the idealized infinite data and infinite model size limits. The model was first studied in [7], where the bottlenecks due to finite width and finite dataset size were computed and, for certain data structure, resulted in a Chinchilla-type scaling result as in [18].

artificial intelligence, deterministic equivalence, machine learning, (15 more...)

arXiv.org Machine Learning

2502.05074

Country:

North America > Canada > Quebec > Montreal (0.14)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
(2 more...)

Genre: Research Report (0.64)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.90)

Add feedback

Trajectory of Mini-Batch Momentum: Batch Size Saturation and Convergence in High Dimensions

Neural Information Processing SystemsJan-19-2025, 06:24:47 GMT

We analyze the dynamics of large batch stochastic gradient descent with momentum (SGD M) on the least squares problem when both the number of samples and dimensions are large. In this setting, we show that the dynamics of SGD M converge to a deterministic discrete Volterra equation as dimension increases, which we analyze. We identify a stability measurement, the implicit conditioning ratio (ICR), which regulates the ability of SGD M to accelerate the algorithm. When the batch size exceeds this ICR, SGD M converges linearly at a rate of \mathcal{O}(1/\sqrt{\kappa}), matching optimal full-batch momentum (in particular performing as well as a full-batch but with a fraction of the size). For batch sizes smaller than the ICR, in contrast, SGD M has rates that scale like a multiple of the single batch SGD rate.

batch size saturation and convergence, high dimension, mini-batch momentum, (3 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.64)

Add feedback

Implicit Regularization or Implicit Conditioning? Exact Risk Trajectories of SGD in High Dimensions

Neural Information Processing SystemsJan-19-2025, 05:03:03 GMT

Stochastic gradient descent (SGD) is a pillar of modern machine learning, serving as the go-to optimization algorithm for a diverse array of problems. While the empirical success of SGD is often attributed to its computational efficiency and favorable generalization behavior, neither effect is well understood and disentangling them remains an open problem. Even in the simple setting of convex quadratic problems, worst-case analyses give an asymptotic convergence rate for SGD that is no better than full-batch gradient descent (GD), and the purported implicit regularization effects of SGD lack a precise explanation. In this work, we study the dynamics of multi-pass SGD on high-dimensional convex quadratics and establish an asymptotic equivalence to a stochastic differential equation, which we call homogenized stochastic gradient descent (HSGD), whose solutions we characterize explicitly in terms of a Volterra integral equation. These results yield precise formulas for the learning and risk trajectories, which reveal a mechanism of implicit conditioning that explains the efficiency of SGD relative to GD.

exact risk trajectory, implicit regularization, sgd, (6 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Add feedback

Exact Risk Curves of signSGD in High-Dimensions: Quantifying Preconditioning and Noise-Compression Effects

Xiao, Ke Liang, Marshall, Noah, Agarwala, Atish, Paquette, Elliot

arXiv.org Machine LearningNov-18-2024

The success of deep learning has been driven by the effectiveness of relatively simple stochastic optimization algorithms. Stochastic gradient descent ( SGD) with momentum can be used to train models like ResNet50 with minimal hyperparameter tuning. The workhorse of modern machine learning is Adam, which was designed to give an approximation of preconditioning with a diagonal, online approximation of the Fisher information matrix (Kingma, 2014). Additional hypotheses for the success of Adam include its ability to maintain balanced updates to parameters across layers and its potential noise-mitigating effects (Zhang et al., 2020; 2024). Getting a quantitative, theoretical understanding of Adam and its variants is hindered by their complexity. While the multiple exponential moving averages are easy to implement, they complicate analysis. The practical desire for simpler, more efficient learning algorithms as well as the theoretical desire for simpler models to analyze have led to a resurgence in the study of signSGD .

max 1, noise, signsgd, (15 more...)

arXiv.org Machine Learning

2411.12135

Country:

North America > Canada > Quebec > Montreal (0.14)
North America > United States > New York (0.04)
North America > United States > Oregon > Multnomah County > Portland (0.04)
Asia > Middle East > Qatar > Ad-Dawhah > Doha (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.54)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

Dynamics of Stochastic Momentum Methods on Large-scale, Quadratic Models

Neural Information Processing SystemsOct-10-2024, 07:12:12 GMT

We analyze a class of stochastic gradient algorithms with momentum on a high-dimensional random least squares problem. Our framework, inspired by random matrix theory, provides an exact (deterministic) characterization for the sequence of function values produced by these algorithms which is expressed only in terms of the eigenvalues of the Hessian. This leads to simple expressions for nearly-optimal hyperparameters, a description of the limiting neighborhood, and average-case complexity. As a consequence, we show that (small-batch) stochastic heavy-ball momentum with a fixed momentum parameter provides no actual performance improvement over SGD when step sizes are adjusted correctly. For contrast, in the non-strongly convex setting, it is possible to get a large improvement over SGD using momentum.

large-scale, quadratic model, stochastic momentum method, (4 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.45)

Add feedback

Filters

Collaborating Authors

paquette

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

4+3 Phases of Compute-Optimal Neural Scaling Laws

Implicit Regularization or Implicit Conditioning? Exact Risk Trajectories of SGD in High Dimensions

High-Dimensional Privacy-Utility Dynamics of Noisy Stochastic Gradient Descent on Least Squares

4+3 Phases of Compute-Optimal Neural Scaling Laws

The High Line: Exact Risk and Learning Rate Curves of Stochastic Adaptive Learning Rate Algorithms

Two-Point Deterministic Equivalence for Stochastic Gradient Dynamics in Linear Models

Trajectory of Mini-Batch Momentum: Batch Size Saturation and Convergence in High Dimensions

Implicit Regularization or Implicit Conditioning? Exact Risk Trajectories of SGD in High Dimensions

Exact Risk Curves of signSGD in High-Dimensions: Quantifying Preconditioning and Noise-Compression Effects

Dynamics of Stochastic Momentum Methods on Large-scale, Quadratic Models