Multi-Head Mixture-of-Experts

May-27-2025, 12:13:20 GMT–Neural Information Processing Systems

However, it exhibits the low expert activation issue, i.e., only a small subset of experts are activated for optimization, leading to suboptimal performance and limiting its effectiveness in learning a larger number of experts in complex tasks. In this paper, we propose Multi-Head Mixture-of-Experts (MH-MoE). MH-MoE split each input token into multiple sub-tokens, then these sub-tokens are assigned to and processed by a diverse set of experts in parallel, and seamlessly reintegrated into the original token form. The above operations enables MH-MoE to significantly enhance expert activation while collectively attend to information from various representation spaces within different experts to deepen context understanding. Besides, it's worth noting that our MH-MoE is straightforward to implement and decouples from other SMoE frameworks, making it easy to integrate with these frameworks for enhanced performance.

artificial intelligence, mh-moe, multi-head mixture-of-expert, (2 more...)

Neural Information Processing Systems

May-27-2025, 12:13:20 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence (0.45)