Unchosen Experts Can Contribute Too: Unleashing MoE Models ' Power by Self-Contrast Cheng Yang 1 Jiahao Wang 3
–Neural Information Processing Systems
Mixture-of-Experts (MoE) has emerged as a prominent architecture for scaling model size while maintaining computational efficiency. In MoE, each token in the input sequence activates a different subset of experts determined by a routing mechanism. However, the unchosen experts in MoE models do not contribute to the output, potentially leading to underutilization of the model's capacity. In this work, we first conduct exploratory studies to demonstrate that increasing the number of activated experts does not necessarily improve and can even degrade the output quality. Then, we show that output distributions from an MoE model using different routing strategies substantially differ, indicating that different experts do not always act synergistically.
Neural Information Processing Systems
Mar-27-2025, 15:31:53 GMT