BAM! Just Like That: Simple and Efficient Parameter Upcycling for Mixtureof Experts Simon Guo 4
–Neural Information Processing Systems
The Mixture of Experts (MoE) framework has become a popular architecture for large language models due to its superior performance over dense models. However, training MoEs from scratch in a large-scale regime is prohibitively expensive. Existing methods mitigate this by pre-training multiple dense expert models independently and using them to initialize an MoE. This is done by using experts' feed-forward network (FFN) to initialize the MoE's experts while merging other parameters. However, this method limits the reuse of dense model parameters to only the FFN layers, thereby constraining the advantages when "upcycling" these models into MoEs.
Neural Information Processing Systems
Mar-21-2025, 18:57:40 GMT
- Country:
- North America > United States (0.14)
- Genre:
- Research Report (1.00)
- Industry:
- Law (0.68)
- Technology: