$μ$-Parametrization for Mixture of Experts

Małaśnicki, Jan, Ciebiera, Kamil, Boruń, Mateusz, Pióro, Maciej, Ludziejewski, Jan, Stefaniak, Maciej, Krutul, Michał, Jaszczur, Sebastian, Cygan, Marek, Adamczewski, Kamil, Krajewski, Jakub

arXiv.org Artificial Intelligence 

Recent years have seen a growing interest and adoption of LLMs, with Mixture-of-Experts (MoE) emerging as a leading architecture in extremely large models. Currently, the largest open-source models reach over $1$T parameters. At such scales, hyperparameter tuning becomes prohibitively expensive. Precisely for this reason, the $μ$Transfer is becoming a key technique. It allows for seamless transfer of optimal hyperparameters across model scales, resulting in a huge reduction in tuning costs. However, existing work has primarily focused on dense LLMs, leaving MoE architectures unexplored. In this work, we derive a $μ$-Parameterization for MoE, providing theoretical guarantees for feature learning across model widths. Our experiments demonstrate that the optimal learning rate reliably transfers across model sizes, establishing a foundation for efficient hyperparameter tuning in large-scale MoE models.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found