Towards an empirical understanding of MoE design choices

Fan, Dongyang, Messmer, Bettina, Jaggi, Martin

Feb-20-2024–arXiv.org Artificial Intelligence

The Mixture of Experts (MoEs) has been receiving unprecedented attention in the LLM era. While initially it has been proposed by Jacobs et al. (1991) to encourage expert specialization when the model is under-parameterized to fit the whole data domain, the contemporary practices (Fedus et al., 2022; Shazeer et al., 2017) do not specifically seek for expert specialization aspects, instead, they use MoE as a tool to scale up model expressiveness at a reduced inference cost. A study by Zoph et al. (2022a) revealed the existence of expert specialization in encoder blocks, particularly at a lexicon level. Furthermore, the recent Mistral paper by Jiang et al. (2024) provides evidence that the router exhibits structured syntactic behavior rather than topic-level understanding. We posit that the cultivation of fine-grained expert specialization is facilitated by Token-level routing mechanisms.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

Feb-20-2024

arXiv.org PDF

Add feedback

Genre:
- Research Report (0.83)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning (1.00)
  - Natural Language > Large Language Model (0.67)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found