CoSMoEs: Compact Sparse Mixture of Experts

Huber, Patrick, Shrivastava, Akshat, Chang, Ernie, Sankar, Chinnadhurai, Aly, Ahmed, Sagar, Adithya

Feb-28-2025–arXiv.org Artificial Intelligence

Sparse Mixture of Expert (MoE) models are popular foundational architectures at large scale, however, under-explored at smaller sizes. Here, we show how to enable Compact Sparse Mixture of Experts (CoSMoEs) for on-device inference. Specifically, we tackle the three main on-device dimensions: Quality, Memory and Latency. Along the quality axis, we show that in a fair evaluation (removing confounding factors) MoE architectures outperform FLOP-aligned dense models at on-device scale. We introduce weight-decomposed experts, further improving the MoE model performance. Regarding model memory and latency, we significantly improve model offloading efficiency and, in turn, reduce model inference latency.

arxiv, wang, zhang, (16 more...)

arXiv.org Artificial Intelligence

Feb-28-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - Virginia (0.04)
- Asia
  - British Indian Ocean Territory > Diego Garcia (0.04)
  - Middle East
    - Jordan (0.04)
    - Saudi Arabia > Asir Province
      - Abha (0.04)
  - Japan > Honshū
    - Chūbu > Aichi Prefecture > Nagoya (0.04)

Genre:
- Research Report (0.85)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (0.95)
  - Machine Learning > Neural Networks
    - Deep Learning (0.47)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found