AITopics

2605.22795

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.87)

Mustafi, Aratrika, Mukherjee, Soumya, Sriperumbudur, Bharath K.

Move on Muon : A Hamiltonian probability gradient flow perspective of Muon optimizer

arXiv.org Machine LearningMay-25-2026

We develop a gradient flow on the space of probability measures defined on matrix-valued parameters induced by regularized Muon, an analytically smoothed version of the idealized Muon optimizer. The key observation is that the regularized orthogonalization map is the gradient of a smooth Fenchel-dual smoothing of the nuclear norm. This identifies the (regularized) Muon update as a mirror/prox step in the update variable, with momentum acting as the dual coordinate. We use this structure to lift Muon from a single matrix parameter to finite-particle probability objectives of the form $J(ρ)=R\left(\int F d ρ\right)$, a setting motivated by mean-field descriptions of neural-network training, and derive the inertial continuous-time limit. Using this structure, we derive the finite-particle continuous-time limit under the inertial scaling of step size and momentum, and then pass to a phase-space mean-field equation over probability laws on parameter-momentum pairs. The resulting flow can be shown to be a damped Hamiltonian probability dynamics whose kinetic energy is induced by the regularized Muon mirror potential. We prove an exact Hamiltonian dissipation identity, showing that the Hamiltonian energy decreases monotonically. While the target objective itself need not be monotone along the inertial Muon dynamics, under additional gradient-dominance, bounded-momentum, and curvature/alignment assumptions, we obtain continuous and discrete-time exponential convergence rates for the objective gap. We also study the well-posedness of the mean-field limit equation and establish propagation of chaos guarantees for the interacting particle system. Finally, we extend the formulation to Hilbert-valued feature maps on product matrix spaces, yielding a blockwise Muon probability flow applicable to smooth transformer mixture-of-experts models.

artificial intelligence, assumption, machine learning, (18 more...)

2605.23871

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Hashimoto, Yuka, Iwata, Tomoharu

Unified generalization analysis for physics informed neural networks

arXiv.org Machine LearningMay-14-2026

Physics-Informed Neural Networks (PINNs) and their variational counterparts (VPINNs) are neural networks that incorporate physical laws, making them useful for scientific problems. Existing generalization analyses for PINNs and VPINNs remain limited, often requiring restrictive assumptions such as stability conditions or linear ellipticity. In this paper, we derive generalization bounds for neural networks that involve differentiation with respect to input variables, covering PINNs and VPINNs under a unified framework. We apply Taylor expansion to represent nonlinear differential operators as linear operators on a high-dimensional space, enabling the use of Koopman-based analysis and showing that high-rank networks can generalize well even in settings involving differential operators. We also show that the nonlinearity of the differential operator exponentially enlarges the bound, highlighting its significant impact on generalization.

artificial intelligence, machine learning, neural network, (19 more...)

2605.1326

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Gibson, William, Reisinger, Christoph

Uniform Scaling Limits in AdamW-Trained Transformers

arXiv.org Machine LearningMay-13-2026

We study the large-depth limit of transformers trained with AdamW, by modelling the hidden-state dynamics as an interacting particle system (IPS) coupled through the attention mechanism. Under appropriate scaling of the attention heads, we prove that the joint dynamics of the hidden states and backpropagated variables converge in $L^2$, uniformly over the initial condition, to the solution of a forward--backward system of ODEs at rate $\mathcal O(L^{-1}+L^{-1/3}H^{-1/2})$. Here, $L$ and $H$ denote the depth and number of heads of the transformer, respectively. The limiting system of ODEs can be identified with a McKean--Vlasov ODE (MVODE) when the attention heads do not incorporate causal masking. By using the flow maps associated with this MVODE and applying concentration of measure techniques, we obtain bounds on the difference between the discrete and continuous models that are uniform over compact sets of initial conditions. As this is achieved without resorting to a covering argument, the constants in our bounds are independent of the number of tokens. Furthermore, under a suitable adaptation to AdamW, the bounds become independent of the token embedding dimension.

large language model, machine learning, natural language, (19 more...)

2605.11059

Country:

Europe > United Kingdom (0.28)
North America > United States > New York (0.27)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.92)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.67)

Arsalan Sharifnassab, Saber Salehkaleybar, S. Jamaloddin Golestani

Order Optimal One-Shot Distributed Learning

Neural Information Processing SystemsApr-30-2026, 19:48:52 GMT

Neural Information Processing Systems http://nips.cc/

algorithm, artificial intelligence, machine learning, (17 more...)

Country: Asia > Middle East (0.29)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.69)

Neural Information Processing SystemsApr-30-2026, 19:48:37 GMT

018b59ce1fd616d874afad0f44ba338d-AuthorFeedback.pdf

artificial intelligence, log mn, machine learning, (18 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.31)

Neural Information Processing SystemsApr-30-2026, 08:09:14 GMT

f5ccb3ab757131a93586ef61ec701533-Supplemental-Conference.pdf

In this section, we compare the symmetric solutions found in erf [2] and ReLU networks [5] to our one-neuron solution (n =1). The main difference is that both earlier studies constrain the search space to the symmetric subspace whereas we first prove that the non-trivial critical points are contained in this subspace in Theorem 5.1 for a broad class of activation functions, including erf and ReLU. Solving the low-dimensional loss, we recover the same solution for ReLU and erf as in [2, 5] for unit-orthonormal teachers.

artificial intelligence, critical point, machine learning, (19 more...)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.37)

Neural Information Processing SystemsApr-30-2026, 08:09:11 GMT

Should Under-parameterized Student Networks Copy or Average Teacher Weights?

Any continuous function f can be approximated arbitrarily well by a neural network with sufficiently many neurons k. We consider the case when f itself is a neural network with one hidden layer and k neurons. Approximating f with a neural network with n < k neurons can thus be seen as fitting an under-parameterized "student" network with nneurons to a "teacher" network with k neurons. As the student has fewer neurons than the teacher, it is unclear, whether each of the n student neurons should copy one of the teacher neurons or rather average a group of teacher neurons. For shallow neural networks with erf activation function and for the standard Gaussian input distribution, we prove that "copy-average" configurations are critical points if the teacher's incoming vectors are orthonormal and its outgoing weights are unitary. Moreover, the optimum among such configurations is reached when n 1student neurons each copy one teacher neuron and the n-th student neuron averages the remaining k n+1 teacher neurons. For the student network with n = 1 neuron, we provide additionally a closed-form solution of the non-trivial critical point(s) for commonly used activation functions through solving an equivalent constrained optimization problem. Empirically, we find for the erf activation function that gradient flow converges either to the optimal copy-average critical point or to another point where each student neuron approximately copies a different teacher neuron. Finally, we find similar results for the ReLU activation function, suggesting that the optimal solution of underparameterized networks has a universal structure.

activation function, artificial intelligence, machine learning, (18 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)