$μ$-Parametrization for Mixture of Experts

Małaśnicki, Jan, Ciebiera, Kamil, Boruń, Mateusz, Pióro, Maciej, Ludziejewski, Jan, Stefaniak, Maciej, Krutul, Michał, Jaszczur, Sebastian, Cygan, Marek, Adamczewski, Kamil, Krajewski, Jakub

Oct-10-2025–arXiv.org Artificial Intelligence

Recent years have seen a growing interest and adoption of LLMs, with Mixture-of-Experts (MoE) emerging as a leading architecture in extremely large models. Currently, the largest open-source models reach over $1$T parameters. At such scales, hyperparameter tuning becomes prohibitively expensive. Precisely for this reason, the $μ$Transfer is becoming a key technique. It allows for seamless transfer of optimal hyperparameters across model scales, resulting in a huge reduction in tuning costs. However, existing work has primarily focused on dense LLMs, leaving MoE architectures unexplored. In this work, we derive a $μ$-Parameterization for MoE, providing theoretical guarantees for feature learning across model widths. Our experiments demonstrate that the optimal learning rate reliably transfers across model sizes, establishing a foundation for efficient hyperparameter tuning in large-scale MoE models.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

Oct-10-2025

arXiv.org PDF

Add feedback

Country:
- Europe > Poland (0.15)

Genre:
- Research Report (0.51)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks (0.94)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found