AITopics | preconditioner

Collaborating Authors

preconditioner

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

On the Optimizer Dependence of Neural Scaling Laws

Ramani, Vansh, Jain, Shourya Vir

arXiv.org Machine LearningMay-29-2026

The scaling exponent $α$ in neural scaling laws $L(N) \propto N^{-α}$ is commonly treated as a fixed constant set by architecture and data. We present evidence that $α$ depends systematically on the optimizer. In controlled random-feature regression experiments -- the canonical theoretical framework for neural scaling -- we measure $α$ across five optimizer variants and six spectral conditions. Preconditioned optimizers consistently yield steeper scaling (larger $α$), with the $α$-shift increasing across most of the tested spectral range, peaking near $s = 1.5$, and remaining large at $s = 2.0$. At $s \approx 1.0$ (characteristic of natural language), the full natural gradient achieves $α\approx 0.31$ versus $α\approx 0.12$ for gradient descent -- a $2.6\times$ larger fitted exponent that, within the random-feature model, compounds with each model-size doubling. Whether and how this exponent shift transfers to large-scale LLM training -- where recent evidence suggests the advantage may attenuate with scale -- remains an important open question. Our results imply that scaling-law forecasts should account for optimizer choice, and we provide a spectral diagnostic predicting when advanced optimizers will pay off.

large language model, machine learning, natural language, (18 more...)

arXiv.org Machine Learning

2605.29387

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.48)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

Negligible in Size, Significant in Effect: On Scale Vectors in Large Language Models

Wang, Mingze, Zhu, Shuchen, Fang, Yuxin, Li, Binghui, Shen, Kai, Zhong, Shu

arXiv.org Machine LearningMay-27-2026

Normalization layers in modern large language models (LLMs) consist of a deterministic normalization operation and a learnable scale vector. While the normalization operation has been extensively studied, the scale vector remains poorly understood despite its ubiquitous use. In this work, we present a systematic study of scale vectors in LLMs from the perspectives of expressivity, optimization, and architectural structure. First, we show empirically that although scale vectors constitute only a negligible fraction of model parameters, removing them substantially degrades LLM pre-training. Our theory further shows that, in Pre-Norm architectures, scale vectors do not increase expressivity; instead, they improve optimization through a self-amplifying preconditioning effect on subsequent linear mappings. Second, we investigate the role of weight decay for scale vectors. By distinguishing Input-Norm and Output-Norm layers, we theoretically show that weight decay is beneficial for the former but harmful for the latter, due to their distinct roles in optimization and expressivity. Third, motivated by this understanding, we propose three lightweight and complementary improvements to scale vectors: branch-specific heterogeneity, improved placement around linear mappings, and magnitude-direction reparameterization. Both theory and experiments show that each improvement yields consistent gains. Finally, we combine these improvements into a unified scale-vector strategy and evaluate it through extensive LLM pre-training experiments on dense and mixture-of-experts models ranging from 0.12B to 2B parameters, across multiple optimizers and learning rate schedules, under industrial-scale token budgets. The unified strategy consistently achieves lower terminal loss than well-tuned baselines and exhibits more favorable scaling behavior, while adding negligible parameter and computational overhead.

large language model, natural language, scale vector, (16 more...)

arXiv.org Machine Learning

2605.26895

Genre: Research Report (0.81)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Correcting Stochastic Update Bias in Preconditioned Language Model Optimizers

Nayak, Nikhil, White, Julia, Zaratiana, Urchade, Zhang, Kelton, Princis, Henrijs, Atreja, Dhruv, Fawcett, Henry, Thomas, Matthew, Hurn-Maloney, George, Lewis, Ash

arXiv.org Machine LearningMay-21-2026

Preconditioned optimizers are central to language model training, but their stochastic update rules are usually treated as direct approximations to population preconditioned descent. We show that this view misses two finite-sample biases. First, the gradient and preconditioner are typically estimated from the same minibatch, introducing gradient--preconditioner coupling bias. Second, even when the preconditioner estimate is unbiased, its inverse or inverse-root is generally biased because inversion is nonlinear. We propose a single-batch bias-correction framework that addresses both effects: cross-fitted preconditioning estimates the numerator and preconditioner from independent microbatch groups, while variance-corrected inversion uses microbatch variability to subtract the leading delta-method bias term. The framework applies to diagonal moment, diagonal curvature, and matrix preconditioning methods, instantiated in AdamW, Sophia, and Shampoo. Bias correction reduces held-out pretraining loss on Qwen2.5-0.5B by $0.15$, $0.07$, and $0.11$ nats, respectively; the effects on mixed-quality pretraining and downstream instruction tuning are consistently neutral-to-positive. Together, these results establish bias correction as a practical mechanism for reducing finite-sample update bias and improving the performance of preconditioned optimizers.

artificial intelligence, machine learning, natural language, (20 more...)

arXiv.org Machine Learning

2605.20756

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)

Add feedback

Dimension-Uniform Discretization Analysis of Preconditioned Annealed Langevin Dynamics for Multimodal Gaussian Mixtures

Baldassari, Lorenzo, Garnier, Josselin, Solna, Knut, de Hoop, Maarten V.

arXiv.org Machine LearningMay-19-2026

Obtaining stable diffusion-based samplers in high- and infinite-dimensional settings is challenging because errors can accumulate across high-frequency coordinates and make the dynamics unstable under refinement of the finite-dimensional approximation of the underlying function-space problem. Discretization is a typical source of such errors, and preconditioning with a suitable spectral decay is one way to control their accumulation. In this paper, we study this problem for preconditioned annealed Langevin dynamics (ALD) applied to Gaussian mixtures. We first show that Euler-Maruyama (EM) discretization, by treating the stiff linear part of the annealed score with a forward Euler step, imposes a stability constraint coupling the preconditioner with the annealed covariance scale. Together with the conditions ensuring dimension-uniform control of the annealed dynamics, this constraint forces the initial smoothed law to remain uniformly close to the target across dimensions. We then consider an exponential-integrator scheme that integrates the stiff linear part of the annealed score exactly. Under explicit spectral summability conditions coupling the smoothing covariance, the component covariance spectra, and the preconditioner, we prove a dimension-uniform Kullback-Leibler (KL) bound for this scheme. This bound can be made arbitrarily small, uniformly in dimension, by allowing enough time for annealing and then refining the time mesh accordingly. Importantly, these conditions allow regimes in which the KL divergence between the target and the initial smoothed law diverges with dimension, showing that the restrictions imposed by EM are scheme-dependent rather than intrinsic to ALD.

artificial intelligence, deep learning, machine learning, (20 more...)

arXiv.org Machine Learning

2605.16473

Country: North America > United States (0.28)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.34)

Add feedback

Revisiting Transformer Layer Parameterization Through Causal Energy Minimization

Xu, Jin, Couturier, Camille, Rühle, Victor, Rajmohan, Saravan, Hensman, James

arXiv.org Machine LearningMay-11-2026

Transformer blocks typically combine multi-head attention (MHA) for token mixing with gated MLPs for token-wise feature transformation, yet many choices in their parameterization remain largely empirical. We introduce Causal Energy Minimization (CEM), a framework that recasts Transformer layers as optimization steps on conditional energy functions while explicitly accounting for layer parameterization. Extending prior energy-based interpretations of attention, CEM shows that weight-tied MHA can be derived as a gradient update on an interaction energy, and that a gated MLP with shared up/down projections can be viewed through an element-wise energy. This perspective identifies a design space for Transformer layers that includes within-layer weight sharing, diagonal-plus-low-rank interactions, lightweight preconditioners, and recursive updates. We evaluate CEM-derived layers in language-modeling experiments at the moderate hundred-million-parameter scale. Despite their constrained parameterizations, these layers train stably and can match corresponding Transformer baselines. Overall, our results suggest that CEM provides a useful lens for understanding Transformer layer parameterization, connecting Transformer architectures to energy-based models and motivating further exploration of energy-guided layer designs.

large language model, machine learning, parameterization, (18 more...)

arXiv.org Machine Learning

2605.07588

Genre: Research Report > New Finding (0.54)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(2 more...)

Add feedback

f4eaa4b8f2d08edb3f0af990d56134ea-Paper-Conference.pdf

Neural Information Processing SystemsApr-30-2026, 07:56:02 GMT

algorithm, artificial intelligence, machine learning, (15 more...)

Neural Information Processing Systems

Country: North America (0.28)

Industry: Information Technology > Security & Privacy (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

When Does Dynamic Preconditioning Preserve the Polyak-Ruppert CLT? A Stabilization Threshold

An, Sunyoung, Huo, Xiaoming

arXiv.org Machine LearningApr-28-2026

The central limit theorem (CLT) is a foundation of statistical inference: it provides the asymptotic distribution needed for confidence intervals, hypothesis tests, and efficiency comparisons [24, 42]. For iterate-averaged stochastic gradient methods, it specifies both a Gaussian limit and its sandwich covariance in a single theorem statement. This foundation now underpins inference in streaming and online settings--online A/B testing, continual monitoring of treatment effects, and streaming M-estimation, for example--where the estimator is updated one observation at a time and inference must be performed in real time. A line of recent work develops online inference procedures for averaged SGD [10, 23, 46]. In practice, one-pass stochastic optimization is routinely combined with adaptive preconditioning, which improves computational efficiency and is believed to sharpen the resulting Gaussian approximation in finite samples. If the CLT fails or the asymptotic variance is altered by the adaptive preconditioning, all downstream inference-- coverage of confidence intervals, size of hypothesis tests, consistency of plug-in covariance estimators--is compromised. A rigorous understanding of when adaptive preconditioning preserves the CLT is, therefore, a prerequisite for reliable inference in these settings.

artificial intelligence, hypothesis, machine learning, (19 more...)

arXiv.org Machine Learning

2604.23498

Genre: Research Report (0.64)

Technology: