preconditioner
On the Optimizer Dependence of Neural Scaling Laws
Ramani, Vansh, Jain, Shourya Vir
The scaling exponent $ฮฑ$ in neural scaling laws $L(N) \propto N^{-ฮฑ}$ is commonly treated as a fixed constant set by architecture and data. We present evidence that $ฮฑ$ depends systematically on the optimizer. In controlled random-feature regression experiments -- the canonical theoretical framework for neural scaling -- we measure $ฮฑ$ across five optimizer variants and six spectral conditions. Preconditioned optimizers consistently yield steeper scaling (larger $ฮฑ$), with the $ฮฑ$-shift increasing across most of the tested spectral range, peaking near $s = 1.5$, and remaining large at $s = 2.0$. At $s \approx 1.0$ (characteristic of natural language), the full natural gradient achieves $ฮฑ\approx 0.31$ versus $ฮฑ\approx 0.12$ for gradient descent -- a $2.6\times$ larger fitted exponent that, within the random-feature model, compounds with each model-size doubling. Whether and how this exponent shift transfers to large-scale LLM training -- where recent evidence suggests the advantage may attenuate with scale -- remains an important open question. Our results imply that scaling-law forecasts should account for optimizer choice, and we provide a spectral diagnostic predicting when advanced optimizers will pay off.
Negligible in Size, Significant in Effect: On Scale Vectors in Large Language Models
Wang, Mingze, Zhu, Shuchen, Fang, Yuxin, Li, Binghui, Shen, Kai, Zhong, Shu
Normalization layers in modern large language models (LLMs) consist of a deterministic normalization operation and a learnable scale vector. While the normalization operation has been extensively studied, the scale vector remains poorly understood despite its ubiquitous use. In this work, we present a systematic study of scale vectors in LLMs from the perspectives of expressivity, optimization, and architectural structure. First, we show empirically that although scale vectors constitute only a negligible fraction of model parameters, removing them substantially degrades LLM pre-training. Our theory further shows that, in Pre-Norm architectures, scale vectors do not increase expressivity; instead, they improve optimization through a self-amplifying preconditioning effect on subsequent linear mappings. Second, we investigate the role of weight decay for scale vectors. By distinguishing Input-Norm and Output-Norm layers, we theoretically show that weight decay is beneficial for the former but harmful for the latter, due to their distinct roles in optimization and expressivity. Third, motivated by this understanding, we propose three lightweight and complementary improvements to scale vectors: branch-specific heterogeneity, improved placement around linear mappings, and magnitude-direction reparameterization. Both theory and experiments show that each improvement yields consistent gains. Finally, we combine these improvements into a unified scale-vector strategy and evaluate it through extensive LLM pre-training experiments on dense and mixture-of-experts models ranging from 0.12B to 2B parameters, across multiple optimizers and learning rate schedules, under industrial-scale token budgets. The unified strategy consistently achieves lower terminal loss than well-tuned baselines and exhibits more favorable scaling behavior, while adding negligible parameter and computational overhead.
Correcting Stochastic Update Bias in Preconditioned Language Model Optimizers
Nayak, Nikhil, White, Julia, Zaratiana, Urchade, Zhang, Kelton, Princis, Henrijs, Atreja, Dhruv, Fawcett, Henry, Thomas, Matthew, Hurn-Maloney, George, Lewis, Ash
Preconditioned optimizers are central to language model training, but their stochastic update rules are usually treated as direct approximations to population preconditioned descent. We show that this view misses two finite-sample biases. First, the gradient and preconditioner are typically estimated from the same minibatch, introducing gradient--preconditioner coupling bias. Second, even when the preconditioner estimate is unbiased, its inverse or inverse-root is generally biased because inversion is nonlinear. We propose a single-batch bias-correction framework that addresses both effects: cross-fitted preconditioning estimates the numerator and preconditioner from independent microbatch groups, while variance-corrected inversion uses microbatch variability to subtract the leading delta-method bias term. The framework applies to diagonal moment, diagonal curvature, and matrix preconditioning methods, instantiated in AdamW, Sophia, and Shampoo. Bias correction reduces held-out pretraining loss on Qwen2.5-0.5B by $0.15$, $0.07$, and $0.11$ nats, respectively; the effects on mixed-quality pretraining and downstream instruction tuning are consistently neutral-to-positive. Together, these results establish bias correction as a practical mechanism for reducing finite-sample update bias and improving the performance of preconditioned optimizers.
Dimension-Uniform Discretization Analysis of Preconditioned Annealed Langevin Dynamics for Multimodal Gaussian Mixtures
Baldassari, Lorenzo, Garnier, Josselin, Solna, Knut, de Hoop, Maarten V.
Obtaining stable diffusion-based samplers in high- and infinite-dimensional settings is challenging because errors can accumulate across high-frequency coordinates and make the dynamics unstable under refinement of the finite-dimensional approximation of the underlying function-space problem. Discretization is a typical source of such errors, and preconditioning with a suitable spectral decay is one way to control their accumulation. In this paper, we study this problem for preconditioned annealed Langevin dynamics (ALD) applied to Gaussian mixtures. We first show that Euler-Maruyama (EM) discretization, by treating the stiff linear part of the annealed score with a forward Euler step, imposes a stability constraint coupling the preconditioner with the annealed covariance scale. Together with the conditions ensuring dimension-uniform control of the annealed dynamics, this constraint forces the initial smoothed law to remain uniformly close to the target across dimensions. We then consider an exponential-integrator scheme that integrates the stiff linear part of the annealed score exactly. Under explicit spectral summability conditions coupling the smoothing covariance, the component covariance spectra, and the preconditioner, we prove a dimension-uniform Kullback-Leibler (KL) bound for this scheme. This bound can be made arbitrarily small, uniformly in dimension, by allowing enough time for annealing and then refining the time mesh accordingly. Importantly, these conditions allow regimes in which the KL divergence between the target and the initial smoothed law diverges with dimension, showing that the restrictions imposed by EM are scheme-dependent rather than intrinsic to ALD.
Revisiting Transformer Layer Parameterization Through Causal Energy Minimization
Xu, Jin, Couturier, Camille, Rรผhle, Victor, Rajmohan, Saravan, Hensman, James
Transformer blocks typically combine multi-head attention (MHA) for token mixing with gated MLPs for token-wise feature transformation, yet many choices in their parameterization remain largely empirical. We introduce Causal Energy Minimization (CEM), a framework that recasts Transformer layers as optimization steps on conditional energy functions while explicitly accounting for layer parameterization. Extending prior energy-based interpretations of attention, CEM shows that weight-tied MHA can be derived as a gradient update on an interaction energy, and that a gated MLP with shared up/down projections can be viewed through an element-wise energy. This perspective identifies a design space for Transformer layers that includes within-layer weight sharing, diagonal-plus-low-rank interactions, lightweight preconditioners, and recursive updates. We evaluate CEM-derived layers in language-modeling experiments at the moderate hundred-million-parameter scale. Despite their constrained parameterizations, these layers train stably and can match corresponding Transformer baselines. Overall, our results suggest that CEM provides a useful lens for understanding Transformer layer parameterization, connecting Transformer architectures to energy-based models and motivating further exploration of energy-guided layer designs.
When Does Dynamic Preconditioning Preserve the Polyak-Ruppert CLT? A Stabilization Threshold
The central limit theorem (CLT) is a foundation of statistical inference: it provides the asymptotic distribution needed for confidence intervals, hypothesis tests, and efficiency comparisons [24, 42]. For iterate-averaged stochastic gradient methods, it specifies both a Gaussian limit and its sandwich covariance in a single theorem statement. This foundation now underpins inference in streaming and online settings--online A/B testing, continual monitoring of treatment effects, and streaming M-estimation, for example--where the estimator is updated one observation at a time and inference must be performed in real time. A line of recent work develops online inference procedures for averaged SGD [10, 23, 46]. In practice, one-pass stochastic optimization is routinely combined with adaptive preconditioning, which improves computational efficiency and is believed to sharpen the resulting Gaussian approximation in finite samples. If the CLT fails or the asymptotic variance is altered by the adaptive preconditioning, all downstream inference-- coverage of confidence intervals, size of hypothesis tests, consistency of plug-in covariance estimators--is compromised. A rigorous understanding of when adaptive preconditioning preserves the CLT is, therefore, a prerequisite for reliable inference in these settings.