Goto

Collaborating Authors

 theorem4


The Origin of Edge of Stability

Litman, Elon

arXiv.org Machine Learning

Full-batch gradient descent on neural networks drives the largest Hessian eigenvalue to the threshold $2/η$, where $η$ is the learning rate. This phenomenon, the Edge of Stability, has resisted a unified explanation: existing accounts establish self-regulation near the edge but do not explain why the trajectory is forced toward $2/η$ from arbitrary initialization. We introduce the edge coupling, a functional on consecutive iterate pairs whose coefficient is uniquely fixed by the gradient-descent update. Differencing its criticality condition yields a step recurrence with stability boundary $2/η$, and a second-order expansion yields a loss-change formula whose telescoping sum forces curvature toward $2/η$. The two formulas involve different Hessian averages, but the mean value theorem localizes each to the true Hessian at an interior point of the step segment, yielding exact forcing of the Hessian eigenvalue with no gap. Setting both gradients of the edge coupling to zero classifies fixed points and period-two orbits; near a fixed point, the problem reduces to a function of the half-amplitude alone, which determines which directions support period-two orbits and on which side of the critical learning rate they appear.


Tail-Aware Information-Theoretic Generalization for RLHF and SGLD

Zhang, Huiming, Li, Binghan, Tian, Wan, Sun, Qiang

arXiv.org Machine Learning

Classical information-theoretic generalization bounds typically control the generalization gap through KL-based mutual information and therefore rely on boundedness or sub-Gaussian tails via the moment generating function (MGF). In many modern pipelines, such as robust learning, RLHF, and stochastic optimization, losses and rewards can be heavy-tailed, and MGFs may not exist, rendering KL-based tools ineffective. We develop a tail-dependent information-theoretic framework for sub-Weibull data, where the tail parameter $θ$ controls the tail heaviness: $θ=2$ corresponds to sub-Gaussian, $θ=1$ to sub-exponential, and $0<θ<1$ to genuinely heavy tails. Our key technical ingredient is a decorrelation lemma that bounds change-of-measure expectations using a shifted-log $f_θ$-divergence, which admits explicit comparisons to Rényi divergence without MGF arguments. On the empirical-process side, we establish sharp maximal inequalities and a Dudley-type chaining bound for sub-Weibull processes with tail index $θ$, with complexity scaling as $\log^{1/θ}$ and entropy$^{1/θ}$. These tools yield expected and high-probability PAC-Bayes generalization bounds, as well as an information-theoretic chaining inequality based on multiscale Rényi mutual information. We illustrate the consequences in Rényi-regularized RLHF under heavy-tailed rewards and in stochastic gradient Langevin dynamics with heavy-tailed gradient noise.


Iterative Identification Closure: Amplifying Causal Identifiability in Linear SEMs

Ding, Ziyi, Zhang, Xiao-Ping

arXiv.org Machine Learning

The Half-Trek Criterion (HTC) is the primary graphical tool for determining generic identifiability of causal effect coefficients in linear structural equation models (SEMs) with latent confounders. However, HTC is inherently node-wise: it simultaneously resolves all incoming edges of a node, leaving a gap of "inconclusive" causal effects (15-23% in moderate graphs). We introduce Iterative Identification Closure (IIC), a general framework that decouples causal identification into two phases: (1) a seed function S_0 that identifies an initial set of edges from any external source of information (instrumental variables, interventions, non-Gaussianity, prior knowledge, etc.); and (2) Reduced HTC propagation that iteratively substitutes known coefficients to reduce system dimension, enabling identification of edges that standard HTC cannot resolve. The core novelty is iterative identification propagation: newly identified edges feed back to unlock further identification -- a mechanism absent from all existing graphical criteria, which treat each edge (or node) in isolation. This propagation is non-trivial: coefficient substitution alters the covariance structure, and soundness requires proving that the modified Jacobian retains generic full rank -- a new theoretical result (Reduced HTC Theorem). We prove that IIC is sound, monotone, converges in O(|E|) iterations (empirically <=2), and strictly subsumes both HTC and ancestor decomposition. Exhaustive verification on all graphs with n<=5 (134,144 edges) confirms 100% precision (zero false positives); with combined seeds, IIC reduces the HTC gap by over 80%. The propagation gain is gamma~4x (2 seeds identifying ~3% of edges to 97.5% total identification), far exceeding gamma<=1.2x of prior methods that incorporate side information without iterative feedback.


Characterization of Gaussian Universality Breakdown in High-Dimensional Empirical Risk Minimization

Yaakoubi, Chiheb, Louart, Cosme, Tiomoko, Malik, Liao, Zhenyu

arXiv.org Machine Learning

We study high-dimensional convex empirical risk minimization (ERM) under general non-Gaussian data designs. By heuristically extending the Convex Gaussian Min-Max Theorem (CGMT) to non-Gaussian settings, we derive an asymptotic min-max characterization of key statistics, enabling approximation of the mean $μ_{\hatθ}$ and covariance $C_{\hatθ}$ of the ERM estimator $\hatθ$. Specifically, under a concentration assumption on the data matrix and standard regularity conditions on the loss and regularizer, we show that for a test covariate $x$ independent of the training data, the projection $\hatθ^\top x$ approximately follows the convolution of the (generally non-Gaussian) distribution of $μ_{\hatθ}^\top x$ with an independent centered Gaussian variable of variance $\text{Tr}(C_{\hatθ}\mathbb{E}[xx^\top])$. This result clarifies the scope and limits of Gaussian universality for ERMs. Additionally, we prove that any $\mathcal{C}^2$ regularizer is asymptotically equivalent to a quadratic form determined solely by its Hessian at zero and gradient at $μ_{\hatθ}$. Numerical simulations across diverse losses and models are provided to validate our theoretical predictions and qualitative insights.


Overfitting and Generalizing with (PAC) Bayesian Prediction in Noisy Binary Classification

Zhu, Xiaohan, Ohannessian, Mesrob I., Srebro, Nathan

arXiv.org Machine Learning

We consider a PAC-Bayes type learning rule for binary classification, balancing the training error of a randomized ''posterior'' predictor with its KL divergence to a pre-specified ''prior''. This can be seen as an extension of a modified two-part-code Minimum Description Length (MDL) learning rule, to continuous priors and randomized predictions. With a balancing parameter of $λ=1$ this learning rule recovers an (empirical) Bayes posterior and a modified variant recovers the profile posterior, linking with standard Bayesian prediction (up to the treatment of the single-parameter noise level). However, from a risk-minimization prediction perspective, this Bayesian predictor overfits and can lead to non-vanishing excess loss in the agnostic case. Instead a choice of $λ\gg 1$, which can be seen as using a sample-size-dependent-prior, ensures uniformly vanishing excess loss even in the agnostic case. We precisely characterize the effect of under-regularizing (and over-regularizing) as a function of the balance parameter $λ$, understanding the regimes in which this under-regularization is tempered or catastrophic. This work extends previous work by Zhu and Srebro [2025] that considered only discrete priors to PAC Bayes type learning rules and, through their rigorous Bayesian interpretation, to Bayesian prediction more generally.


Active Seriation: Efficient Ordering Recovery with Statistical Guarantees

Cheshire, James, Issartel, Yann

arXiv.org Machine Learning

Active seriation aims at recovering an unknown ordering of $n$ items by adaptively querying pairwise similarities. The observations are noisy measurements of entries of an underlying $n$ x $n$ permuted Robinson matrix, whose permutation encodes the latent ordering. The framework allows the algorithm to start with partial information on the latent ordering, including seriation from scratch as a special case. We propose an active seriation algorithm that provably recovers the latent ordering with high probability. Under a uniform separation condition on the similarity matrix, optimal performance guarantees are established, both in terms of the probability of error and the number of observations required for successful recovery.


Influence Malleability in Linearized Attention: Dual Implications of Non-Convergent NTK Dynamics

Miñoza, Jose Marie Antonio, Medina, Paulo Mario P., Ibañez, Sebastian C.

arXiv.org Machine Learning

Understanding the theoretical foundations of attention mechanisms remains challenging due to their complex, non-linear dynamics. This work reveals a fundamental trade-off in the learning dynamics of linearized attention. Using a linearized attention mechanism with exact correspondence to a data-dependent Gram-induced kernel, both empirical and theoretical analysis through the Neural Tangent Kernel (NTK) framework shows that linearized attention does not converge to its infinite-width NTK limit, even at large widths. A spectral amplification result establishes this formally: the attention transformation cubes the Gram matrix's condition number, requiring width $m = Ω(κ^6)$ for convergence, a threshold that exceeds any practical width for natural image datasets. This non-convergence is characterized through influence malleability, the capacity to dynamically alter reliance on training examples. Attention exhibits 6--9$\times$ higher malleability than ReLU networks, with dual implications: its data-dependent kernel can reduce approximation error by aligning with task structure, but this same sensitivity increases susceptibility to adversarial manipulation of training data. These findings suggest that attention's power and vulnerability share a common origin in its departure from the kernel regime.


TowardtheFundamentalLimitsofImitation Learning

Neural Information Processing Systems

We then propose a novel algorithm based on minimum-distance functionals in the setting where the transition model is given and the expert is deterministic.Thealgorithmissuboptimalby .|S|H3/2/N,matchingourlower


How To Make the Gradients Small Stochastically: Even Faster Convex and Nonconvex SGD

Zeyuan Allen-Zhu

Neural Information Processing Systems

However, in terms of making the gradients small, the original SGD does not give an optimal rate, even when f(x) is convex. If f(x) is convex, to find a point with gradient norm ε, we design an algorithm SGD3withanear-optimalrate eO(ε 2),improvingthebestknownrateO(ε 8/3) of [17].


CategorizedBandits

Neural Information Processing Systems

In the multi-armed bandit problem, an agent has several possible decisions, usually referred to as "arms", and chooses or "pulls" sequentially one of them at each time step. This generates a sequence of rewards and the objective is to maximize their cumulative sum.