Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks
Nakamura, Taishi, Ishikawa, Satoki, Kawamura, Masaki, Okamoto, Takumi, Nohara, Daisuke, Suzuki, Jun, Yokota, Rio
–arXiv.org Artificial Intelligence
Empirical scaling laws have driven the evolution of large language models (LLMs), yet their coefficients shift whenever the model architecture or data pipeline changes. Mixture-of-Experts (MoE) models, now standard in state-of-the-art systems, introduce a new sparsity dimension that current dense-model frontiers overlook. We investigate how MoE sparsity influences two distinct capability regimes: memorization skills and reasoning skills. By training MoE families that vary total parameters, active parameters, and top-k routing under fixed compute budgets, we disentangle pre-training loss from downstream accuracy. Our results reveal two principles. First, Active FLOPs: models with identical training loss but greater active compute achieve higher reasoning accuracy. Second, T otal tokens per parameter (TPP): memorization tasks improve with more parameters, while reasoning tasks benefit from optimal TPP, indicating that reasoning is data-hungry. Neither reinforcement learning post-training (GRPO) nor increased test-time compute alters these trends. We therefore argue that optimal MoE sparsity must be determined jointly by active FLOPs and TPP, revising the classical picture of compute-optimal scaling. The recent evolution of large language models (LLMs) has been driven by empirical scaling laws (Hestness et al., 2017) that link training loss to model size, dataset size, and compute budget. Kaplan et al. (2020) showed that these laws hold across seven orders of magnitude, establishing them as a reliable extrapolation tool for dense Transformers. Subsequent work by Hoffmann et al. (2022) demonstrated that scaling curves can be inverted to choose the compute-optimal combination of parameters and tokens for a fixed budget. Together, these results have made scaling analysis a cornerstone of model planning at both academic and industrial labs.
arXiv.org Artificial Intelligence
Sep-26-2025
- Country:
- Asia
- Japan > Honshū
- Kantō > Tokyo Metropolis Prefecture
- Tokyo (0.04)
- Tōhoku (0.04)
- Kantō > Tokyo Metropolis Prefecture
- Middle East > Jordan (0.05)
- Japan > Honshū
- Asia
- Genre:
- Research Report > New Finding (1.00)
- Industry:
- Education (0.66)
- Technology: