Goto

Collaborating Authors

 exponent


On the Optimizer Dependence of Neural Scaling Laws

arXiv.org Machine Learning

The scaling exponent $α$ in neural scaling laws $L(N) \propto N^{-α}$ is commonly treated as a fixed constant set by architecture and data. We present evidence that $α$ depends systematically on the optimizer. In controlled random-feature regression experiments -- the canonical theoretical framework for neural scaling -- we measure $α$ across five optimizer variants and six spectral conditions. Preconditioned optimizers consistently yield steeper scaling (larger $α$), with the $α$-shift increasing across most of the tested spectral range, peaking near $s = 1.5$, and remaining large at $s = 2.0$. At $s \approx 1.0$ (characteristic of natural language), the full natural gradient achieves $α\approx 0.31$ versus $α\approx 0.12$ for gradient descent -- a $2.6\times$ larger fitted exponent that, within the random-feature model, compounds with each model-size doubling. Whether and how this exponent shift transfers to large-scale LLM training -- where recent evidence suggests the advantage may attenuate with scale -- remains an important open question. Our results imply that scaling-law forecasts should account for optimizer choice, and we provide a spectral diagnostic predicting when advanced optimizers will pay off.


How Neural Reward Models Learn Features for Policy Optimization: A Single-Index Analysis

arXiv.org Machine Learning

Reward modeling is not only a prediction problem: in KL-regularized policy optimization, the learned reward is exponentiated to define the deployed policy, so downstream value depends on errors in reward-tilted regions. We study this feedback in a Gaussian single-index model with $r^*(x) = σ^*(\langle θ^*, x\rangle)$ and $x \sim N(0, I_d)$. We analyze a two-stage neural reward model that first learns the hidden direction $θ^*$ from reward-weighted samples and then fits the readout layer by weighted ridge regression. Exponential reward weighting changes the Hermite signal available to the first layer; for any feature-learning temperature $β_1$ above a dimension-free $O(1)$ threshold, a constant fraction of neurons recover the hidden direction, with weak-recovery complexity governed by the generative exponent. After feature recovery, we derive tilted-policy value-gap bounds for an idealized label-weighted fit with weights $e^{y/β_2}$ and a more practical surrogate-weighted fit with weights $e^{r_{a_0}(x)/β_2}$. Keeping the $β_2$-dependence explicit yields an admissible set of deployment temperatures, balancing the gain from lowering $β_2$ against the learning cost amplified by exponential weighting; in the surrogate-weighted case, proxy-dependent factors shrink this admissible set.


Asymmetric Scaling Laws from Sparse Features

arXiv.org Machine Learning

We introduce a model for neural scaling laws under sparse activations. In the model, test loss is often dominated by rare coordinates that are never observed in the training input. This mechanism induces a novel bottleneck absent from dense models. We derive the asymptotic population loss in both the underparameterized and overparameterized regimes, and show that the loss exhibits a double-descent peak near the interpolation threshold -- where the number of parameters is just sufficient to fit the training data -- resulting in a loss curve governed by two distinct scaling exponents -- one for the overparameterized regime and one for the underparameterized regime -- with a gap determined by the degree of sparsity. Additionally, we derive a compute-optimal frontier that favors increasing dataset size over model capacity under fixed compute budgets. We also analyze gradient-descent dynamics and identify a scaling law for the probability that fixed-step gradient descent becomes unstable. We further show that the sparsity-induced effect persists under nonlinear activations.


Dropout Universality: Scaling Laws and Optimal Scheduling at the Edge-of-Chaos

arXiv.org Machine Learning

We develop a mean-field theory of dropout as a perturbation of critical signal propagation at the edge of chaos. Dropout shifts the perfect-alignment fixed point, making the depth scale for information propagation finite even at critical initialization. We derive critical and crossover scaling laws for correlation decay and establish that smooth activations and kinked, ReLU-like activations constitute distinct universality classes, with different critical exponents and a universal two-parameter scaling collapse in detuning and dropout strength. The distinction traces to the analytic structure of the correlation map: smooth activations admit a Taylor expansion near perfect alignment, while kinked activations develop a branch point with universal non-analyticity. As a corollary, the framework yields saturated dropout profiles under fixed budget; a rank-flow tie-breaker then selects front-loaded schedules, substantially reducing held-out test loss at no extra computational cost, with accuracy gains as a consistent secondary effect. We test the predictions in MLPs and Vision Transformers and discuss CNN/ResNet extensions.


Quantifying Hyperparameter Transfer and the Importance of Embedding Layer Learning Rate

arXiv.org Machine Learning

Hyperparameter transfer allows extrapolating optimal optimization hyperparameters from small to large scales, making it critical for training large language models (LLMs). This is done either by fitting a scaling law to the hyperparameters or by a judicious choice of parameterization, such as Maximal Update ($μ$P), that renders optimal hyperparameters approximately scale invariant. In this paper, we first develop a framework to quantify hyperparameter transfer through three metrics: (1) the quality of the scaling law fit, (2) the robustness to extrapolation errors, and (3) the asymptotic loss penalty due to choice of parameterization. Next, we investigate through a comprehensive series of ablations why $μ$P appears to offer high-quality learning rate transfer relative to standard parameterization (SP), as existing theory is inadequate. We find that the overwhelming benefit of $μ$P relative to SP when training with AdamW arises simply from maximizing the learning rate of the embedding layer. In SP, the embedding layer learning rate acts as a bottleneck that induces training instabilities; increasing it by a factor of width to match $μ$P dramatically smooths out training while improving hyperparameter transfer. We also find that weight decay improves the scaling law fits, while, in the fixed token-per-parameter setting, it hurts the robustness of the extrapolation.


Intrinsic Wasserstein Rates for Score-Based Generative Models on Smooth Manifolds

arXiv.org Machine Learning

Score-based generative models are trained in high-dimensional ambient spaces, yet many data distributions are supported on low-dimensional nonlinear structures. We prove that, for compact $d$-dimensional smooth manifolds $\mathcal{M} \subset [0,1]^D$ with $d > 2$ and $β$-Hölder densities strictly positive on $\mathcal{M}$, a variance-preserving SGM estimator attains the intrinsic Wasserstein--1 sample exponent $\tilde{\mathcal{O}}(D^{\mathcal{O}_β(d)}n^{-(β+1)/(d+2β)})$, up to logarithmic factors and explicit geometry and density factors. The full nonasymptotic bound explicitly isolates the finite-order geometry envelope, Hölder radius, density lower bound, ambient dependence, and finite-order correction terms. The analysis separates score approximation into a large-noise tangent-cell regime and a small-noise projection-centered, de-Gaussianized Laplace regime. The key technical ingredient is a ReLU implementation of nearest-projection coordinates via finite intrinsic anchors and Gauss--Newton iterations, rather than approximating the manifold projection as a black-box high-dimensional smooth map. Consequently, for families with polynomially controlled geometry and density lower bounds, the constructed score-network parameters have polynomial ambient dependence.


How to Scale Mixture-of-Experts: From muP to the Maximally Scale-Stable Parameterization

arXiv.org Machine Learning

Recent frontier large language models predominantly rely on Mixture-of-Experts (MoE) architectures. Despite empirical progress, there is still no principled understanding of how hyperparameters should scale with network width $N$, expert width $N_e$, number of experts $M$, sparsity $K$, and depth $L$ to ensure both stability and optimal performance at scale. We take a principled step toward resolving this gap by analyzing three different scaling regimes: (I) co-scaling $N\asymp N_e$, (II) co-scaling $N\asymp M\asymp K$, and (III) full proportional scaling of $N, N_e, M$, and $K$. For each regime, we develop a novel Dynamical Mean Field Theory (DMFT) description of the limiting training dynamics of MoEs that provides a formal foundation for our analysis. Within this framework, we derive the unique parameterization for SGD and Adam satisfying all maximal-update ($μ$) desiderata. We then show that the resulting $μ$P prescription does not reliably induce monotonic improvement with scale or robust learning-rate transfer. We trace these pathologies to scale-dependent observables in the aggregation dynamics, which motivates a refined set of desiderata that we term maximal scale stability. Guided by this principle, we derive a Maximally Scale-Stable Parameterization (MSSP) for both SGD and Adam in all three scaling regimes, and characterize the corresponding limiting dynamics - qualitatively distinct from the $μ$P limit - through a separate DMFT analysis. Experiments verify that MSSP robustly recovers learning rate transfer and monotonic improvement with scale across regimes. Combined with existing depth-scaling theory, these results provide a complete scaling prescription for MoE architectures as a function of width, depth, expert width, and number of experts.


A Unified Framework for Critical Scaling of Inverse Temperature in Self-Attention

arXiv.org Machine Learning

Length-dependent logit rescaling is widely used to stabilize long-context self-attention, but existing analyses and methods suggest conflicting inverse-temperature laws for the context length $n$, ranging from $(\log n)^{1/2}$ to $\log n$ and $(\log n)^2$. We provide a general theory showing that the desirable scale is determined by the gap-counting function $N_n$ of each attention row. Counting how many competitors lie within each gap from the maximum, we define an upper-tail accumulation scale and prove that it gives the critical inverse-temperature scale for softmax concentration: below this scale, the top competitors remain unseparated, whereas above it, the attention entropy collapses. This framework unifies prior scaling laws as different $N_n$ and yields a direct diagnostic for attention-score families, from idealized theoretical models to more practical transformers.


Sharp feature-learning transitions and Bayes-optimal neural scaling laws in extensive-width networks

arXiv.org Machine Learning

We study the information-theoretic limits of learning a one-hidden-layer teacher network with hierarchical features from noisy queries, in the context of knowledge transfer to a smaller student model. We work in the high-dimensional regime where the teacher width $k$ scales linearly with the input dimension $d$ -- a setting that captures large-but-finite-width networks and has only recently become analytically tractable. Using a heuristic leave-one-out decoupling argument, validated numerically throughout, we derive asymptotically sharp characterizations of the Bayes-optimal generalization error and individual feature overlaps via a system of closed fixed-point equations. These equations reveal that feature learnability is governed by a sequence of sharp phase transitions: as data grows, teacher features become recoverable sequentially, each through a discontinuous jump in overlap. This sequential acquisition underlies a precise notion of \textit{effective width} $k_c$ -- the number of learnable features at a given data budget $n$ -- which unifies two distinct scaling regimes: a feature-learning regime in which the Bayes-optimal generalization error $\varepsilon^{\rm BO}$ scales as $ n^{1/(2β)-1}$, and a refinement regime in which it scales as $n^{-1}$, where $β>1/2$ is the exponent of the power-law feature hierarchy. Both laws collapse to the single relation $\varepsilon^{\rm BO}=Θ(k_c d/n)$. We further show empirically that a student trained with \textsc{Adam} near the effective width $k_c$ achieves these optimal scaling laws (up to a small algorithmic gap), and provide an information-theoretic account of the associated scaling in model size.


Regret Analysis of Guided Diffusion for Black-Box Optimization over Structured Inputs

arXiv.org Machine Learning

Guided-diffusion black-box optimization (BO) has shown strong empirical performance on structured design problems such as molecules and crystals, but its regret behavior remains poorly understood. Existing BO regret analyses typically rely on maximum information gain, non-pretrained surrogate models, or exact acquisition maximization -- assumptions that break down in modern diffusion -- BO pipelines, where pretrained diffusion models serve as powerful priors over valid structures and acquisition maximization is replaced by approximate sampling over astronomically large discrete spaces. We develop a first certificate-based expected simple-regret framework for guided-diffusion BO that avoids maximum-information-gain bounds, RKHS assumptions, and exact acquisition maximization. The central quantity in our analysis is mass lift: the increase in probability mass assigned to near-optimal designs relative to the pretrained generator. This view explains how exponential-looking finite-budget convergence and polynomial acceleration can all arise from the same mechanism. We also give practical diagnostics for estimating search exponents from finite candidate pools and a proposal-corrected resampling construction that provides a fully certified sampler instance.