BAM! Just Like That: Simple and Efficient Parameter Upcycling for Mixtureof Experts Simon Guo 4

Mar-21-2025, 18:57:40 GMT–Neural Information Processing Systems

The Mixture of Experts (MoE) framework has become a popular architecture for large language models due to its superior performance over dense models. However, training MoEs from scratch in a large-scale regime is prohibitively expensive. Existing methods mitigate this by pre-training multiple dense expert models independently and using them to initialize an MoE. This is done by using experts' feed-forward network (FFN) to initialize the MoE's experts while merging other parameters. However, this method limits the reuse of dense model parameters to only the FFN layers, thereby constraining the advantages when "upcycling" these models into MoEs.

large language model, machine learning, natural language, (17 more...)

Neural Information Processing Systems

Mar-21-2025, 18:57:40 GMT

Conferences PDF

Add feedback

Country:
- North America > United States (0.14)

Genre:
- Research Report (1.00)

Industry:
- Law (0.68)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.68)
  - Natural Language > Large Language Model (1.00)
  - Representation & Reasoning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found