Practical Efficiency of Muon for Pretraining

AI, Essential, :, null, Shah, Ishaan, Polloreno, Anthony M., Stratos, Karl, Monk, Philip, Chaluvaraju, Adarsh, Hojel, Andrew, Ma, Andrew, Thomas, Anil, Tanwer, Ashish, Shah, Darsh J, Nguyen, Khoi, Smith, Kurt, Callahan, Michael, Pust, Michael, Parmar, Mohit, Rushton, Peter, Mazarakis, Platon, Kapila, Ritvik, Srivastava, Saurabh, Singla, Somanshu, Romanski, Tim, Vanjani, Yash, Vaswani, Ashish

May-21-2025–arXiv.org Machine Learning

We demonstrate that Muon, the simplest instantiation of a second-order optimizer, explicitly expands the Pareto frontier over AdamW on the compute-time tradeoff. We find that Muon is more effective than AdamW in retaining data efficiency at large batch sizes, far beyond the so-called critical batch size, while remaining computationally efficient, thus enabling more economical training. We study the combination of Muon and the maximal update parameterization (muP) for efficient hyperparameter transfer and present a simple telescoping algorithm that accounts for all sources of error in muP while introducing only a modest overhead in resources. We validate our findings through extensive experiments with model sizes up to four billion parameters and ablations on the data distribution and architecture.

large language model, machine learning, natural language, (18 more...)

arXiv.org Machine Learning

May-21-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - California > San Francisco County > San Francisco (0.14)
- Asia > Middle East
  - Jordan (0.05)

Genre:
- Research Report (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Representation & Reasoning (1.00)
  - Natural Language > Large Language Model (0.93)
  - Machine Learning > Neural Networks
    - Deep Learning (0.68)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found