$μ$-Parametrization for Mixture of Experts
Małaśnicki, Jan, Ciebiera, Kamil, Boruń, Mateusz, Pióro, Maciej, Ludziejewski, Jan, Stefaniak, Maciej, Krutul, Michał, Jaszczur, Sebastian, Cygan, Marek, Adamczewski, Kamil, Krajewski, Jakub
–arXiv.org Artificial Intelligence
Recent years have seen a growing interest and adoption of LLMs, with Mixture-of-Experts (MoE) emerging as a leading architecture in extremely large models. Currently, the largest open-source models reach over $1$T parameters. At such scales, hyperparameter tuning becomes prohibitively expensive. Precisely for this reason, the $μ$Transfer is becoming a key technique. It allows for seamless transfer of optimal hyperparameters across model scales, resulting in a huge reduction in tuning costs. However, existing work has primarily focused on dense LLMs, leaving MoE architectures unexplored. In this work, we derive a $μ$-Parameterization for MoE, providing theoretical guarantees for feature learning across model widths. Our experiments demonstrate that the optimal learning rate reliably transfers across model sizes, establishing a foundation for efficient hyperparameter tuning in large-scale MoE models.
arXiv.org Artificial Intelligence
Oct-10-2025
- Country:
- Asia > Middle East
- Jordan (0.04)
- Europe > Poland
- Lower Silesia Province > Wroclaw (0.04)
- Masovia Province > Warsaw (0.06)
- Asia > Middle East
- Genre:
- Research Report (0.51)
- Technology: