Towards an empirical understanding of MoE design choices
Fan, Dongyang, Messmer, Bettina, Jaggi, Martin
–arXiv.org Artificial Intelligence
The Mixture of Experts (MoEs) has been receiving unprecedented attention in the LLM era. While initially it has been proposed by Jacobs et al. (1991) to encourage expert specialization when the model is under-parameterized to fit the whole data domain, the contemporary practices (Fedus et al., 2022; Shazeer et al., 2017) do not specifically seek for expert specialization aspects, instead, they use MoE as a tool to scale up model expressiveness at a reduced inference cost. A study by Zoph et al. (2022a) revealed the existence of expert specialization in encoder blocks, particularly at a lexicon level. Furthermore, the recent Mistral paper by Jiang et al. (2024) provides evidence that the router exhibits structured syntactic behavior rather than topic-level understanding. We posit that the cultivation of fine-grained expert specialization is facilitated by Token-level routing mechanisms.
arXiv.org Artificial Intelligence
Feb-20-2024