Goto

Collaborating Authors

 curvature


The Origin of Edge of Stability

Litman, Elon

arXiv.org Machine Learning

Full-batch gradient descent on neural networks drives the largest Hessian eigenvalue to the threshold $2/η$, where $η$ is the learning rate. This phenomenon, the Edge of Stability, has resisted a unified explanation: existing accounts establish self-regulation near the edge but do not explain why the trajectory is forced toward $2/η$ from arbitrary initialization. We introduce the edge coupling, a functional on consecutive iterate pairs whose coefficient is uniquely fixed by the gradient-descent update. Differencing its criticality condition yields a step recurrence with stability boundary $2/η$, and a second-order expansion yields a loss-change formula whose telescoping sum forces curvature toward $2/η$. The two formulas involve different Hessian averages, but the mean value theorem localizes each to the true Hessian at an interior point of the step segment, yielding exact forcing of the Hessian eigenvalue with no gap. Setting both gradients of the edge coupling to zero classifies fixed points and period-two orbits; near a fixed point, the problem reduces to a function of the half-amplitude alone, which determines which directions support period-two orbits and on which side of the critical learning rate they appear.


Horospherical Depth and Busemann Median on Hadamard Manifolds

Jiang, Yangdi, Chang, Xiaotian, Mostajeran, Cyrus

arXiv.org Machine Learning

\We introduce the horospherical depth, an intrinsic notion of statistical depth on Hadamard manifolds, and define the Busemann median as the set of its maximizers. The construction exploits the fact that the linear functionals appearing in Tukey's half-space depth are themselves limits of renormalized distance functions; on a Hadamard manifold the same limiting procedure produces Busemann functions, whose sublevel sets are horoballs, the intrinsic replacements for halfspaces. The resulting depth is parametrized by the visual boundary, is isometry-equivariant, and requires neither tangent-space linearization nor a chosen base point.For arbitrary Hadamard manifolds, we prove that the depth regions are nested and geodesically convex, that a centerpoint of depth at least $1/(d+1)$ exists, and hence that the Busemann median exists for every Borel probability measure. Under strictly negative sectional curvature and mild regularity assumptions, the depth is strictly quasi-concave and the median is unique. We also establish robustness: the depth is stable under total-variation perturbations, and under contamination escaping to infinity the limiting median depends on the escape direction but not on how far the contaminating mass has moved along the geodesic ray, in contrast with the Fréchet mean. Finally, we establish uniform consistency of the sample depth and convergence of sample depth regions and sample Busemann medians; on symmetric spaces of noncompact type, the argument proceeds through a VC analysis of upper horospherical halfspaces, while on general Hadamard manifolds it follows from a compactness argument under a mild non-atomicity assumption.


Gating Enables Curvature: A Geometric Expressivity Gap in Attention

Bathula, Satwik, Joshi, Anand A.

arXiv.org Machine Learning

Multiplicative gating is widely used in neural architectures and has recently been applied to attention layers to improve performance and training stability in large language models. Despite the success of gated attention, the mathematical implications of gated attention mechanisms remain poorly understood. We study attention through the geometry of its representations by modeling outputs as mean parameters of Gaussian distributions and analyzing the induced Fisher--Rao geometry. We show that ungated attention operator is restricted to intrinsically flat statistical manifolds due to its affine structure, while multiplicative gating enables non-flat geometries, including positively curved manifolds that are unattainable in the ungated setting. These results establish a geometric expressivity gap between ungated and gated attention. Empirically, we show that gated models exhibit higher representation curvature and improved performance on tasks requiring nonlinear decision boundaries whereas they provide no consistent advantage on tasks with linear decision boundaries. Furthermore, we identify a structured regime in which curvature accumulates under composition, yielding a systematic depth amplification effect.


Robustness of classifiers: from adversarial to random noise

Neural Information Processing Systems

Several recent works have shown that state-of-the-art classifiers are vulnerable to worst-case (i.e., adversarial) perturbations of the datapoints. On the other hand, it has been empirically observed that these same classifiers are relatively robust to random noise. In this paper, we propose to study a semi-random noise regime that generalizes both the random and worst-case noise regimes. We propose the first quantitative analysis of the robustness of nonlinear classifiers in this general noise regime. We establish precise theoretical bounds on the robustness of classifiers in this general regime, which depend on the curvature of the classifier's decision boundary. Our bounds confirm and quantify the empirical observations that classifiers satisfying curvature constraints are robust to random noise. Moreover, we quantify the robustness of classifiers in terms of the subspace dimension in the semi-random noise regime, and show that our bounds remarkably interpolate between the worst-case and random noise regimes. We perform experiments and show that the derived bounds provide very accurate estimates when applied to various state-of-the-art deep neural networks and datasets. This result suggests bounds on the curvature of the classifiers' decision boundaries that we support experimentally, and more generally offers important insights onto the geometry of high dimensional classification problems.


Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation

Neural Information Processing Systems

In this work, we propose to apply trust region optimization to deep reinforcement learning using a recently proposed Kronecker-factored approximation to the curvature. We extend the framework of natural policy gradient and propose to optimize both the actor and the critic using Kronecker-factored approximate curvature (K-FAC) with trust region; hence we call our method Actor Critic using Kronecker-Factored Trust Region (ACKTR). To the best of our knowledge, this is the first scalable trust region natural gradient method for actor-critic methods. It is also the method that learns non-trivial tasks in continuous control as well as discrete control policies directly from raw pixel inputs. We tested our approach across discrete domains in Atari games as well as continuous domains in the MuJoCo environment. With the proposed methods, we are able to achieve higher rewards and a 2-to 3-fold improvement in sample efficiency on average, compared to previous state-of-the-art on-policy actor-critic methods.


IQP Born Machines under Data-dependent and Agnostic Initialization Strategies

Lerch, Sacha, Bowles, Joseph, Puig, Ricard, Armengol, Erik, Holmes, Zoë, Thanasilp, Supanut

arXiv.org Machine Learning

Quantum circuit Born machines based on instantaneous quantum polynomial-time (IQP) circuits are natural candidates for quantum generative modeling, both because of their probabilistic structure and because IQP sampling is provably classically hard in certain regimes. Recent proposals focus on training IQP-QCBMs using Maximum Mean Discrepancy (MMD) losses built from low-body Pauli-$Z$ correlators, but the effect of initialization on the resulting optimization landscape remains poorly understood. In this work, we address this by first proving that the MMD loss landscape suffers from barren plateaus for random full-angle-range initializations of IQP circuits. We then establish lower bounds on the loss variance for identity and an unbiased data-agnostic initialization. We then additionally consider a data-dependent initialization that is better aligned with the target distribution and, under suitable assumptions, yields provable gradients and generally converges quicker to a good minimum (as indicated by our training of circuits with 150 qubits on genomic data). Finally, as a by-product, the developed variance lower bound framework is applicable to a general class of non-linear losses, offering a broader toolset for analyzing warm-starts in quantum machine learning.


The Spectrum of the Fisher Information Matrix of a Single-Hidden-Layer Neural Network

Neural Information Processing Systems

An important factor contributing to the success of deep learning has been the remarkable ability to optimize large neural networks using simple first-order optimization algorithms like stochastic gradient descent. While the efficiency of such methods depends crucially on the local curvature of the loss surface, very little is actually known about how this geometry depends on network architecture and hyperparameters. In this work, we extend a recently-developed framework for studying spectra of nonlinear random matrices to characterize an important measure of curvature, namely the eigenvalues of the Fisher information matrix. We focus on a single-hidden-layer neural network with Gaussian data and weights and provide an exact expression for the spectrum in the limit of infinite width. We find that linear networks suffer worse conditioning than nonlinear networks and that nonlinear networks are generically non-degenerate. We also predict and demonstrate empirically that by adjusting the nonlinearity, the spectrum can be tuned so as to improve the efficiency of first-order optimization methods.