Unveiling Super Experts in Mixture-of-Experts Large Language Models
Su, Zunhai, Li, Qingyuan, Zhang, Hao, Ye, Weihao, Xue, Qibo, Qian, YuLei, Xie, Yuchen, Wong, Ngai, Yuan, Kehong
–arXiv.org Artificial Intelligence
Leveraging the intrinsic importance differences among experts, recent research has explored expert-level compression techniques to enhance the efficiency of Mixture-of-Experts (MoE) large language models (LLMs). However, existing approaches often rely on empirical heuristics to identify critical experts, while lacking a deeper understanding into the heterogeneous importance of experts and the inner workings of MoE LLMs. In this study, we report, for the first time, the discovery and systematic investigation of a distinct subset of experts that play a pivotal role in the model's forward inference. These experts are prevalent in open-source MoE LLMs, and despite their extremely limited number, pruning them results in a substantial decline in model performance (e.g., prune just three out of 6,144 causes Qwen3-30B-A3B to generate repetitive and uninformative outputs). We refer to these experts as Super Experts (SEs). Our comprehensive analysis provides progressively deeper insights into SEs: (i) SEs are characterized by rare but extreme activation outliers in the output of the down proj, which give rise to massive activations in the hidden states between decoder layers. Moreover, the distribution of SEs is model-specific, data-agnostic, and remains unaffected by post-training processes. We show that, in MoE LLMs, SEs serve as the primary source of the systematic outlier mechanism in Transformers, and that compressing them profoundly disrupts this process, ultimately causing the collapse of attention sinks. These findings advance the understanding of the internal dynamics of MoE LLMs, filling an important gap in the current knowledge. In addition, we developed an automated tool for rapid and accurate SE profiling. Sparsely activated Mixture-of-Experts (MoE) models employ dynamic routing and sparse activation, demonstrating significant potential in enhancing the learning capacity of large language models (LLMs) (Cai et al., 2024; Mu & Lin, 2025). This paradigm has led to the development of state-of-the-art MoE LLMs, including DeepSeek (Guo et al., 2025; Liu et al., 2024b), Qwen (Y ang et al., 2025a), LongCat-Flash (Team et al., 2025) and others.
arXiv.org Artificial Intelligence
Nov-13-2025
- Country:
- Asia
- China
- Fujian Province > Xiamen (0.04)
- Hong Kong (0.04)
- Jiangsu Province > Nanjing (0.04)
- Middle East > Jordan (0.04)
- Myanmar > Tanintharyi Region
- Dawei (0.04)
- China
- North America > United States (0.14)
- Asia
- Genre:
- Research Report > New Finding (0.87)
- Technology: