manifold
Neural Hamiltonian Diffusions for Modeling Structured Geometric Dynamics
We introduce Neural Hamiltonian Diffusion, a unified framework for learning stochastic Hamiltonian dynamics on differentiable manifolds. While Hamiltonian Neural Networks (HNNs) model conservative systems in flat Euclidean space, they fail to account for geometric structure and intrinsic stochasticity. Conversely, diffusion models on Riemannian manifolds offer geometry-aware stochastic modeling but lack physical inductive biases. Our method parameterizes a Hamiltonian with a neural network and defines its dynamics as a stochastic differential equation on a (pseudo-)Riemannian manifold equipped with a Poisson structure. This enables physically consistent modeling of dynamics on curved, periodic, or causally structured spaces. We demonstrate that the proposed geometric dynamics generalizes existing approaches and applies to systems ranging from molecular dynamics to relativistic n-body problems.
Adaptive Riemannian ADMM for Nonsmooth Optimization: Optimal Complexity without Smoothing
We study the problem of minimizing the sum of a smooth function and a nonsmooth convex regularizer over a compact Riemannian submanifold embedded in Euclidean space. By introducing an auxiliary splitting variable, we propose an adaptive Riemannian alternating direction method of multipliers (ARADMM), which, for the first time, achieves convergence without requiring smoothing of the nonsmooth term. Our approach involves only one Riemannian gradient evaluation and one proximal update per iteration. Through careful and adaptive coordination of the stepsizes and penalty parameters, we establish an optimal iteration complexity of order O(ϵ 3) for finding an ϵ-approximate KKT point, matching the complexity of existing smoothing technique-based Riemannian ADMM methods. Extensive numerical experiments on sparse PCA and robust subspace recovery demonstrate that our ARADMM consistently outperforms state-of-the-art Riemannian ADMM variants in convergence speed and solution quality.
b2c4b7d34b3d96b9dc12f7bce424b7ae-Paper-Conference.pdf
Attention sink (AS) is a consistent pattern in transformer attention maps where certain tokens (often special tokens or positional anchors) disproportionately attract attention from other tokens. We show that in transformers, AS is not an architectural artifact, but it is the manifestation of a fundamental geometric principle: the establishment of reference frames that anchor representational spaces. We analyze several architectures and identify three distinct reference frame types, centralized, distributed, and bidirectional, that correlate with the attention sink phenomenon. We show that they emerge during the earliest stages of training as optimal solutions to the problem of establishing stable coordinate systems in high-dimensional spaces. We show the influence of architecture components, particularly position encoding implementations, on the specific type of reference frame. This perspective transforms our understanding of transformer attention mechanisms and provides insights for both architecture design and the relationship with AS.
Adam Reduces a Unique Form of Sharpness: Theoretical Insights Near the Minimizer Manifold
Despite the popularity of the Adam optimizer in practice, most theoretical analyses study Stochastic Gradient Descent (SGD) as a proxy for Adam, and little is known about how the solutions found by Adam differ. In this paper, we show that Adam implicitly reduces a unique form of sharpness measure shaped by its adaptive updates, leading to qualitatively different solutions from SGD. More specifically, when the training loss is small, Adam wanders around the manifold of minimizers and takes semi-gradients to minimize this sharpness measure in an adaptive manner, a behavior we rigorously characterize through a continuous-time approximation using stochastic differential equations. We further demonstrate how this behavior differs from that of SGD in a well-studied setting: when training overparameterized models with label noise, SGD has been shown to minimize the trace of the Hessian matrix, tr(H), whereas we prove that Adam minimizes tr(Diag(H)1/2) instead. In solving sparse linear regression with diagonal linear networks, this distinction enables Adam to achieve better sparsity and generalization than SGD. Finally, our analysis framework extends beyond Adam to a broad class of adaptive gradient methods, including RMSProp, Adam-mini, Adalayer and Shampoo, and provides a unified perspective on how these adaptive optimizers reduce sharpness, which we hope will offer insights for future optimizer design.
Sample complexity of data-driven tuning of model hyperparameters in neural networks with structured parameter-dependent dual function
Modern machine learning algorithms, especially deep learning-based techniques, typically involve careful hyperparameter tuning to achieve the best performance. Despite the surge of intense interest in practical techniques like Bayesian optimization and random search-based approaches to automating this laborious and compute-intensive task, the fundamental learning-theoretic complexity of tuning hyperparameters for deep neural networks is poorly understood. Inspired by this glaring gap, we initiate the formal study of hyperparameter tuning complexity in deep learning through a recently introduced data-driven setting. We assume that we have a series of learning tasks, and we have to tune hyperparameters to do well on average over the distribution of tasks. A major difficulty is that the utility function as a function of the hyperparameter is very volatile and furthermore, it is given implicitly by an optimization problem over the model parameters. To tackle this challenge, we introduce a new technique to characterize the discontinuities and oscillations of the utility function on any fixed problem instance as we vary the hyperparameter; our analysis relies on subtle concepts including tools from algebraic geometry, differential geometry and constrained optimization. We use this to show that the learning-theoretic complexity of the corresponding family of utility functions is bounded. We instantiate our results and provide sample complexity bounds for concrete applications--tuning a hyperparameter that interpolates neural activation functions and setting the kernel parameter in graph neural networks.
Connecting Neural Models Latent Geometries with Relative Geodesic Representations
Neural models learn representations of high-dimensional data on low-dimensional manifolds. Multiple factors, including stochasticities in the training process, model architectures, and additional inductive biases, may induce different representations, even when learning the same task on the same data. However, it has recently been shown that when a latent structure is shared between distinct latent spaces, relative distances between representations can be preserved, up to distortions. Building on this idea, we demonstrate that exploiting the differential-geometric structure of latent spaces of neural models, it is possible to capture precisely the transformations between representational spaces trained on similar data distributions. Specifically, we assume that distinct neural models parametrize approximately the same underlying manifold, and introduce a representation based on the pullback metric that captures the intrinsic structure of the latent space, while scaling efficiently to large models.
Improving the Euclidean Diffusion Generation of Manifold Data by Mitigating Score Function Singularity
Euclidean diffusion models have achieved remarkable success in generative modeling across diverse domains, and they have been extended to manifold cases in recent advances. Instead of explicitly utilizing the structure of special manifolds as studied in previous works, in this paper we investigate direct sampling of the Euclidean diffusion models for general manifold-structured data. We reveal the multiscale singularity of the score function in the ambient space, which hinders the accuracy of diffusion-generated samples. We then present an elaborate theoretical analysis of the singularity structure of the score function by decomposing it along the tangential and normal directions of the manifold. To mitigate the singularity and improve the sampling accuracy, we propose two novel methods: (1) Niso-DM, which reduces the scale discrepancies in the score function by utilizing a nonisotropic noise, and (2) Tango-DM, which trains only the tangential component of the score function using a tangential-only loss function. Numerical experiments demonstrate that our methods achieve superior performance on distributions over various manifolds with complex geometries.
Riemannian Proximal Sampler for High-accuracy Sampling on Manifolds
We introduce the Riemannian Proximal Sampler, a method for sampling from densities defined on Riemannian manifolds. The performance of this sampler critically depends on two key oracles: the Manifold Brownian Increments (MBI) oracle and the Riemannian Heat-kernel (RHK) oracle. We establish high-accuracy sampling guarantees for the Riemannian Proximal Sampler, showing that generating samples with ε-accuracy requires O(log(1/ε)) iterations in Kullback-Leibler divergence assuming access to exact oracles and O(log2(1/ε))iterations in the total variation metric assuming access to sufficiently accurate inexact oracles.
Continuous Diffusion Model for Language Modeling
Diffusion models have emerged as a promising alternative to autoregressive models in modeling discrete categorical data. However, diffusion models that directly work on discrete data space fail to fully exploit the power of iterative refinement, as the signals are lost during transitions between discrete states. Existing continuous diffusion models for discrete data underperform compared to discrete methods, and the lack of a clear connection between the two approaches hinders the development of effective diffusion models for discrete data. In this work, we propose a continuous diffusion model for language modeling that incorporates the geometry of the underlying categorical distribution. We establish a connection between the discrete diffusion and continuous flow on the statistical manifold, and building on this analogy, introduce a simple diffusion process that generalizes existing discrete diffusion models. We further propose a simulation-free training framework based on radial symmetry, along with a simple technique to address the high dimensionality of the manifold. Comprehensive experiments on language modeling benchmarks and other modalities show that our method outperforms existing discrete diffusion models and approaches the performance of autoregressive models. The code is available at https://github.com/harryjo97/RDLM.