AITopics

2605.23061

Country: North America > United States (0.92)

Genre: Research Report (0.52)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Mustafi, Aratrika, Mukherjee, Soumya, Sriperumbudur, Bharath K.

Move on Muon : A Hamiltonian probability gradient flow perspective of Muon optimizer

arXiv.org Machine LearningMay-25-2026

We develop a gradient flow on the space of probability measures defined on matrix-valued parameters induced by regularized Muon, an analytically smoothed version of the idealized Muon optimizer. The key observation is that the regularized orthogonalization map is the gradient of a smooth Fenchel-dual smoothing of the nuclear norm. This identifies the (regularized) Muon update as a mirror/prox step in the update variable, with momentum acting as the dual coordinate. We use this structure to lift Muon from a single matrix parameter to finite-particle probability objectives of the form $J(ρ)=R\left(\int F d ρ\right)$, a setting motivated by mean-field descriptions of neural-network training, and derive the inertial continuous-time limit. Using this structure, we derive the finite-particle continuous-time limit under the inertial scaling of step size and momentum, and then pass to a phase-space mean-field equation over probability laws on parameter-momentum pairs. The resulting flow can be shown to be a damped Hamiltonian probability dynamics whose kinetic energy is induced by the regularized Muon mirror potential. We prove an exact Hamiltonian dissipation identity, showing that the Hamiltonian energy decreases monotonically. While the target objective itself need not be monotone along the inertial Muon dynamics, under additional gradient-dominance, bounded-momentum, and curvature/alignment assumptions, we obtain continuous and discrete-time exponential convergence rates for the objective gap. We also study the well-posedness of the mean-field limit equation and establish propagation of chaos guarantees for the interacting particle system. Finally, we extend the formulation to Hilbert-valued feature maps on product matrix spaces, yielding a blockwise Muon probability flow applicable to smooth transformer mixture-of-experts models.

artificial intelligence, assumption, machine learning, (18 more...)

2605.23871

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

arXiv.org Machine LearningMay-13-2026

Pion: A Spectrum-Preserving Optimizer via Orthogonal Equivalence Transformation

Shi, Kexuan, Li, Hanxuan, Qiu, Zeju, Wen, Yandong, Buchholz, Simon, Liu, Weiyang

We introduce Pion, a spectrum-preserving optimizer for large language model (LLM) training based on orthogonal equivalence transformation. Unlike additive optimizers such as Adam and Muon, Pion updates each weight matrix through left and right orthogonal transformations, preserving its singular values throughout training. This yields an optimization mechanism that modulates the geometry of weight matrices while keeping their spectral norm fixed. We derive the Pion update rule, systematically examine its design choices, and analyze its convergence behavior along with several key properties. Empirical results show that Pion offers a stable and competitive alternative to standard optimizers for both LLM pretraining and finetuning.

arxiv preprint arxiv, large language model, machine learning, (20 more...)

2605.12492

Country: Asia (0.28)

Genre: Research Report > New Finding (0.87)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Parshakova, Tetiana, Khaled, Ahmed, Crawshaw, Michael, Garrigos, Guillaume, Gower, Robert M.

Muon Does Not Converge on Convex Lipschitz Functions

arXiv.org Machine LearningMay-12-2026

Muon and its variants have shown strong empirical performance in a variety of deep learning tasks. Existing convergence analyses of Muon rely on smoothness assumptions, though arguably the most successful function class for developing deep learning methods (such as AdaGrad, Shampoo, Schedule-Free and more) has been the class of convex and Lipschitz functions. In this paper we question whether the classical convex Lipschitz model is a useful one for understanding Muon. Our answer is no. We show that Muon does not converge on the class of convex and Lipschitz functions, regardless of the choice of learning rate schedule. We also show that error feedback restores convergence of Muon and all the non-Euclidean subgradient methods with momentum. However, this theoretical fix using error feedback degrades the performance of Muon in two representative settings for image classification (CIFAR-10) and language modeling (nanoGPT on FineWeb-Edu 10B). Our conclusion is that convex Lipschitz theory, despite having a prominent role in the design of practical methods for deep learning, is not the most suited one for Muon. This suggests that Muon's success must come from structure absent from this model, most plausibly related to smoothness.

artificial intelligence, machine learning, muon, (16 more...)

2605.0898

Genre: Research Report > New Finding (0.66)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Regis, Eric, Chewi, Sinho

A Rod Flow Model for Adam at the Edge of Stability

arXiv.org Machine LearningMay-11-2026

Neural networks are trained by minimizing loss functions with gradient-based optimizers. Cohen et al. [2021] observed that full-batch gradient descent operates at the edge of stability (EoS): the largest eigenvalue of the Hessian, called the sharpness, first rises (a phase called progressive sharpening) and then hovers at the stability threshold 2/η where η is the learning rate. Cohen et al. [2022] extended this picture to momentum methods and adaptive gradient methods, showing that each optimizer exhibits its own edge of stability. Rather than hovering at 2/η, the relevant quantity--the preconditioned sharpness--hovers at a hyperparameter-dependent threshold that depends on the optimizer (Table 2). In practice, the dominant optimizer in machine learning is Adam [Kingma and Ba, 2015], which differs from gradient descent in two respects.

artificial intelligence, equation, machine learning, (15 more...)

2605.06821

Country: North America (0.28)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.70)

Raviola, Matteo, Peherstorfer, Benjamin

A Dirac-Frenkel-Onsager principle: Instantaneous residual minimization with gauge momentum for nonlinear parametrizations of PDE solutions

arXiv.org Machine LearningMay-4-2026

Dirac-Frenkel instantaneous residual minimization evolves nonlinear parametrizations of PDE solutions in time, but ill-conditioning can render the parameter dynamics non-unique. We interpret this non-uniqueness as a gauge freedom: nullspace directions that leave the time derivative unchanged can be used to select better-conditioned parameter velocities. Building on Onsager's minimum-dissipation principle, we introduce a history variable -- interpretable as momentum -- and inject it only along the nullspace directions. The resulting Dirac-Frenkel-Onsager dynamics preserve instantaneous residual minimization, in contrast to standard regularization that can introduce bias, while promoting temporally smooth parameter evolutions. Examples demonstrate that the approach leads to increased robustness in singular and near-singular regimes.

artificial intelligence, machine learning, parameter velocity, (18 more...)

2605.00284

Country: North America > United States (0.46)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Neural Information Processing SystemsMay-1-2026, 02:03:57 GMT

2dd7f33ffbb59b4ff987be5442a13016-Paper-Conference.pdf

artificial intelligence, machine learning, particle, (18 more...)

Country:

North America > United States (0.46)
Europe (0.28)

Genre: Research Report (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Neural Information Processing SystemsApr-30-2026, 06:39:55 GMT

Momentum Provably Improves Error Feedback!

Due to the high communication overhead when training machine learning models in a distributed environment, modern algorithms invariably rely on lossy communication compression. However, when untreated, the errors caused by compression propagate, and can lead to severely unstable behavior, including exponential divergence. Almost a decade ago, Seide et al. [2014] proposed an error feedback (EF) mechanism, which we refer to as EF14, as an immensely effective heuristic for mitigating this issue. However, despite steady algorithmic and theoretical advances in the EF field in the last decade, our understanding is far from complete. In this work we address one of the most pressing issues.

artificial intelligence, machine learning, reinforcement learning, (18 more...)

Country: Asia (0.27)

Genre: Research Report (0.45)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.46)

Neural Information Processing SystemsApr-29-2026, 19:50:12 GMT

ce26d21662c979d515164b416d4571fe-Paper-Conference.pdf

artificial intelligence, machine learning, meta-adam, (18 more...)

Country: North America > United States (0.28)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)

Neural Information Processing SystemsApr-25-2026, 21:27:09 GMT

Appendix AProofs

The proof follows from the following equality and the fact that Zγ is independent of q(z). All experiments are run on Nvidia GPUs. The exact softwares can be found in the supplemental code. The'letter' split of the EMNIST dataset was used as the auxiliary dataset. The images are resized to are 32x32.

accuracy, artificial intelligence, machine learning, (15 more...)

Industry: Health & Medicine > Therapeutic Area (0.31)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)