Convergence Bound and Critical Batch Size of Muon Optimizer

Sato, Naoki, Naganuma, Hiroki, Iiduka, Hideaki

Nov-24-2025–arXiv.org Artificial Intelligence

Muon, a recently proposed optimizer that leverages the inherent matrix structure of neural network parameters, has demonstrated strong empirical performance, indicating its potential as a successor to standard optimizers such as AdamW. This paper presents theoretical analysis to support its practical success. We provide convergence proofs for Muon across four practical settings, systematically examining its behavior with and without the inclusion of Nesterov momentum and weight decay. Our analysis covers the standard configuration using both, thereby elucidating its real-world performance. We then demonstrate that the addition of weight decay yields strictly tighter theoretical bounds and clarify the interplay between the weight decay coefficient and the learning rate. Finally, we derive the critical batch size for Muon that minimizes the computational cost of training. Our analysis identifies the hyperparameters governing this value, and our experiments validate the corresponding theoretical findings across workloads including image classification and language modeling task.

artificial intelligence, batch size, machine learning, (18 more...)

arXiv.org Artificial Intelligence

Nov-24-2025

arXiv.org PDF

Add feedback

Genre:
- Research Report (0.81)

Technology:
- Information Technology > Artificial Intelligence
  - Representation & Reasoning > Optimization (0.93)
  - Machine Learning > Neural Networks
    - Deep Learning (0.94)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found