Goto

Collaborating Authors

 lipschitz


Move on Muon : A Hamiltonian probability gradient flow perspective of Muon optimizer

arXiv.org Machine Learning

We develop a gradient flow on the space of probability measures defined on matrix-valued parameters induced by regularized Muon, an analytically smoothed version of the idealized Muon optimizer. The key observation is that the regularized orthogonalization map is the gradient of a smooth Fenchel-dual smoothing of the nuclear norm. This identifies the (regularized) Muon update as a mirror/prox step in the update variable, with momentum acting as the dual coordinate. We use this structure to lift Muon from a single matrix parameter to finite-particle probability objectives of the form $J(ρ)=R\left(\int F d ρ\right)$, a setting motivated by mean-field descriptions of neural-network training, and derive the inertial continuous-time limit. Using this structure, we derive the finite-particle continuous-time limit under the inertial scaling of step size and momentum, and then pass to a phase-space mean-field equation over probability laws on parameter-momentum pairs. The resulting flow can be shown to be a damped Hamiltonian probability dynamics whose kinetic energy is induced by the regularized Muon mirror potential. We prove an exact Hamiltonian dissipation identity, showing that the Hamiltonian energy decreases monotonically. While the target objective itself need not be monotone along the inertial Muon dynamics, under additional gradient-dominance, bounded-momentum, and curvature/alignment assumptions, we obtain continuous and discrete-time exponential convergence rates for the objective gap. We also study the well-posedness of the mean-field limit equation and establish propagation of chaos guarantees for the interacting particle system. Finally, we extend the formulation to Hilbert-valued feature maps on product matrix spaces, yielding a blockwise Muon probability flow applicable to smooth transformer mixture-of-experts models.


Forecasting Medium-Horizon Alzheimer's Disease Progression: Residual Gap-Aware Transformers for 24-Month CDR-SB Change from ADNI Clinical and Biomarker Histories

arXiv.org Machine Learning

Medium-horizon Alzheimer's disease progression prediction is difficult because future clinical scores can remain tied to baseline severity, while biomarker histories are irregular and incompletely observed. We develop an anchor-based analysis of 24-month Clinical Dementia Rating Sum of Boxes (CDR-SB) change using harmonized Alzheimer's Disease Neuroimaging Initiative (ADNI) tables. Each labeled sample is anchored at a mild cognitive impairment visit, uses only clinical and biomarker history observed at or before that anchor, and defines the response as CDR-SB at the future visit closest to 24 months within an 18--30 month window minus anchor CDR-SB. The analytic cohort contains 2,600 labeled anchors from 858 participants and 7,276 longitudinal rows. We propose a residual gap-aware transformer that combines a mixed-effects statistical reference with transformer-based residual learning from pre-anchor clinical and biomarker histories. The model uses participant-level random intercepts in the mixed-effects reference, observation-level triplet tokenization for irregular histories, and a learned nonnegative time-gap penalty inside self-attention. We compare the proposed model with a Bayesian-information-criterion-selected linear mixed-effects baseline, GRU-D, and STraTS under repeated participant-level train--test splits. Across five participant-level random seeds, the proposed model achieves the best mean test performance across all reported metrics, reducing MSE by 13.1% and increasing prediction--observation correlation by 26.4% relative to the mixed-effects baseline. It also improves over both GRU-D and STraTS in mean error and correlation. These results show that statistical anchoring and gap-aware residual learning provide a useful structure for medium-horizon Alzheimer's disease progression prediction.


Identifiability and Stability of Generative Drifting with Companion-Elliptic Kernel Families

arXiv.org Machine Learning

This paper studies the identifiability and stability of drifting fields within the framework of Generative Modeling via Drifting. The motivating question is whether a zero-drift equilibrium identifies the target distribution, and whether an approximate zero drift implies weak distributional convergence. Since the original drifting model employs the Laplace kernel by default, we first analyze why standard Gaussian score-based arguments fail to apply. This analysis motivates the introduction of companion-elliptic kernel families, which are characterized by a companion potential satisfying an elliptic closure relation. We show that this class naturally contains the Laplace kernel and consists precisely of Gaussian and Matérn kernels with smoothness parameter $ν\ge 1/2$. Within this class, we establish field identifiability for arbitrary Borel probability measures on $\mathbb{R}^d$: if the drifting field vanishes identically, then the two measures must coincide. As for stability, we demonstrate that field convergence alone does not guarantee weak convergence, since mass may escape to infinity while remaining invisible to the field. Although tightness of the sequence directly removes this obstruction and restores weak stability, we prove that, even without tightness, every $C_0$-vague cluster point lies exactly on the defect ray $\{cp:0\le c\le1\}$. Consequently, a single scalar $C_0$-observable suffices to detect the missing mass and recover weak convergence.


TopoFisher: Learning Topological Summary Statistics by Maximizing Fisher Information

arXiv.org Machine Learning

Persistence diagrams provide stable, interpretable summaries of geometric and topological structure and are useful for simulation-based inference when low-order statistics miss key information. Yet persistence-based pipelines require hand-chosen filtrations, vectorizations, and compressors, typically without an objective tied to parameter uncertainty. We introduce \textbf{TopoFisher}, a differentiable persistent-homology pipeline that learns topological summaries by maximizing local Gaussian Fisher information. Using simulations near a fiducial parameter, TopoFisher optimizes trainable filtrations, diagram vectorizations, and compressors without posterior samples or supervised regression targets, while retaining stable topological inductive bias. We also give sufficient regularity conditions for the log-determinant Fisher loss to be locally Lipschitz in trainable parameters. Controlled experiments on noisy spirals and Gaussian random fields, where total Fisher information is known, show that TopoFisher recovers much of the available information and outperforms fixed topological vectorizations. Our main results are on weak gravitational lensing, a high-dimensional non-Gaussian cosmological field-inference problem. Learned topological summaries reach higher Fisher information than state-of-the-art cosmological summaries and approach an unconstrained Information Maximising Neural Network baseline with up to $\sim80\times$ fewer parameters. The learned filtrations also generalize better: under simulator shift from lognormal to LPT-based maps it retains most Fisher information, while the neural baseline drops, and in neural posterior estimation they give tighter constraints than the neural baseline, and of state-of-the-art cosmological summaries. These results support Fisher-based topological optimization as a robust, parameter-efficient front end for simulation-based inference.


Spectral Graph Sparsification Preserves Representation Geometry in Graph Neural Networks

arXiv.org Machine Learning

Spectral graph sparsification is a classical tool for reducing graph complexity while preserving Laplacian quadratic forms. In graph neural networks (GNNs), sparsification is often used to accelerate computation while maintaining predictive performance. In this work, we study a complementary representation-level question: does sparsification preserve the geometry of learned embeddings? For polynomial-filter GNNs, we prove that any $ε$-spectral sparsifier induces $O(ε)$ perturbations in polynomial graph filters, multilayer hidden representations, and their Gram matrices. These guarantees imply stability of squared pairwise distances, class means, and covariance structure in embedding space. We further establish finite-time training stability: under smoothness and boundedness assumptions, gradient descent on dense and sparsified graphs produces weight trajectories whose separation grows at most proportionally to the sparsification distortion. Empirically, effective-resistance sparsification validates the predicted perturbation chain on synthetic graphs and preserves hidden representation geometry on real datasets. In our experiments, the gram matrix and training dynamics show low divergence even under substantial sparsification, consistent with the predicted stability under spectral sparsification. Hidden Gram preservation strongly predicts neighborhood preservation and class-centroid stability across FashionMNIST, Cora, and Paul15. Together, these results show that spectral sparsification preserves not only graph operators, but also the representation geometry that supports downstream use of GNN embeddings for interpretability.


Decoupled Descent: Exact Test Error Tracking Via Approximate Message Passing

arXiv.org Machine Learning

In modern parametric model training, full-batch gradient descent (and its variants) suffers due to progressively stronger biasing towards the exact realization of training data; this drives the systematic ``generalization gap'', where the train error becomes an unreliable proxy for test error. Existing approaches either argue this gap is benign through complex analysis or sacrifice data to a validation set. In contrast, we introduce decoupled descent (DD), a novel theory-based training algorithm that satisfies a train-test identity -- enforcing the train error to asymptotically track the test error for stylized Gaussian mixture models. Within this specific regime, leveraging approximate message passing theory, DD iteratively cancels the biases due to data reuse, rigorously demonstrating the feasibility of zero-cost validation and $100\%$ data utilization. Moreover, DD is governed by a low-dimensional state evolution recursion, rendering the dynamics of the algorithm transparent and tractable. We validate DD on XOR classification, yielding superior performance compared to GD; additionally, we implement noisy MNIST and non-linear probing of CIFAR-10, demonstrating that even when our stylized assumptions are relaxed, DD narrows the generalization gap compared to GD.



On the Convergence Theory of Debiased Model-Agnostic Meta-Reinforcement Learning

Neural Information Processing Systems

We consider Model-Agnostic Meta-Learning (MAML) methods for Reinforcement Learning (RL) problems, where the goal is to find a policy using data from several tasks represented by Markov Decision Processes (MDPs) that can be updated by one step of stochastic policy gradient for the realized MDP. In particular, using stochastic gradients in MAML update steps is crucial for RL problems since computation of exact gradients requires access to a large number of possible trajectories. For this formulation, we propose a variant of the MAML method, named Stochastic Gradient Meta-Reinforcement Learning (SG-MRL), and study its convergence properties. We derive the iteration and sample complexity of SGMRL to find an -first-order stationary point, which, to the best of our knowledge, provides the first convergence guarantee for model-agnostic meta-reinforcement learning algorithms. We further show how our results extend to the case where more than one step of stochastic policy gradient method is used at test time. Finally, we empirically compare SG-MRL and MAML in several deep RL environments.


Lipschitz regularity in Flow Matching and Diffusion Models: sharp sampling rates and functional inequalities

arXiv.org Machine Learning

Under general assumptions on the target distribution $p^\star$, we establish a sharp Lipschitz regularity theory for flow-matching vector fields and diffusion-model scores, with optimal dependence on time and dimension. As applications, we obtain Wasserstein discretization bounds for Euler-type samplers in dimension $d$: with $N$ discretization steps, the error achieves the optimal rate $\sqrt{d}/N$ up to logarithmic factors. Moreover, the constants do not deteriorate exponentially with the spatial extent of $p^\star$. We also show that the one-sided Lipschitz control yields a globally Lipschitz transport map from the standard Gaussian to $p^\star$, which implies Poincaré and log-Sobolev inequalities for a broad class of probability measures.