Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks

Nakamura, Taishi, Ishikawa, Satoki, Kawamura, Masaki, Okamoto, Takumi, Nohara, Daisuke, Suzuki, Jun, Yokota, Rio

Sep-26-2025–arXiv.org Artificial Intelligence

Empirical scaling laws have driven the evolution of large language models (LLMs), yet their coefficients shift whenever the model architecture or data pipeline changes. Mixture-of-Experts (MoE) models, now standard in state-of-the-art systems, introduce a new sparsity dimension that current dense-model frontiers overlook. We investigate how MoE sparsity influences two distinct capability regimes: memorization skills and reasoning skills. By training MoE families that vary total parameters, active parameters, and top-k routing under fixed compute budgets, we disentangle pre-training loss from downstream accuracy. Our results reveal two principles. First, Active FLOPs: models with identical training loss but greater active compute achieve higher reasoning accuracy. Second, T otal tokens per parameter (TPP): memorization tasks improve with more parameters, while reasoning tasks benefit from optimal TPP, indicating that reasoning is data-hungry. Neither reinforcement learning post-training (GRPO) nor increased test-time compute alters these trends. We therefore argue that optimal MoE sparsity must be determined jointly by active FLOPs and TPP, revising the classical picture of compute-optimal scaling. The recent evolution of large language models (LLMs) has been driven by empirical scaling laws (Hestness et al., 2017) that link training loss to model size, dataset size, and compute budget. Kaplan et al. (2020) showed that these laws hold across seven orders of magnitude, establishing them as a reliable extrapolation tool for dense Transformers. Subsequent work by Hoffmann et al. (2022) demonstrated that scaling curves can be inverted to choose the compute-optimal combination of parameters and tokens for a fixed budget. Together, these results have made scaling analysis a cornerstone of model planning at both academic and industrial labs.

large language model, machine learning, mixture-of-expert language model, (16 more...)

arXiv.org Artificial Intelligence

Sep-26-2025

arXiv.org PDF

Add feedback

Country:
- Asia > Japan (0.28)

Genre:
- Research Report > New Finding (1.00)

Industry:
- Education (0.66)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found