MoE-Infinity: Activation-Aware Expert Offloading for Efficient MoE Serving

Xue, Leyang, Fu, Yao, Lu, Zhan, Mai, Luo, Marina, Mahesh

Jan-25-2024–arXiv.org Artificial Intelligence

This paper presents MoE-Infinity, a cost-efficient mixture-of-expert (MoE) serving system that realizes activation-aware expert offloading. MoE-Infinity features sequence-level expert activation tracing, a new approach adept at identifying sparse activations and capturing the temporal locality of MoE inference. By analyzing these traces, MoE-Infinity performs novel activation-aware expert prefetching and caching, substantially reducing the latency overheads usually associated with offloading experts for improved cost performance. Extensive experiments in a cluster show that MoE-Infinity outperforms numerous existing systems and approaches, reducing latency by 4 - 20X and decreasing deployment costs by over 8X for various MoEs. MoE-Infinity's source code is publicly available at https://github.com/TorchMoE/MoE-Infinity

activation, nfinity, sequence, (15 more...)

arXiv.org Artificial Intelligence

Jan-25-2024

arXiv.org PDF

Add feedback

Genre:
- Research Report (1.00)

Technology:
- Information Technology
  - Hardware (0.72)
  - Artificial Intelligence
    - Natural Language
      - Large Language Model (0.93)
      - Chatbot (0.68)
    - Machine Learning > Neural Networks
      - Deep Learning (1.00)