Enabling MoE on the Edge via Importance-Driven Expert Scheduling
Zhu, Guoying, Li, Meng, Dai, Haipeng, Liu, Xuechen, Wang, Weijun, Li, Keran, xiao, Jun, Chen, Ligeng, Wang, Wei
–arXiv.org Artificial Intelligence
Abstract--The Mixture of Experts (MoE) architecture has emerged as a key technique for scaling Large Language Models by activating only a subset of experts per query. Deploying MoE on consumer-grade edge hardware, however, is constrained by limited device memory, making dynamic expert offloading essential. Unlike prior work that treats offloading purely as a scheduling problem, we leverage expert importance to guide decisions, substituting low-importance active experts with functionally similar ones already cached in GPU memory, thereby preserving accuracy. As a result, this design reduces memory usage and data transfer, while largely eliminating PCIe overhead. In addition, we introduce a scheduling policy that maximizes the reuse ratio of GPU-cached experts, further boosting efficiency. Our extensive evaluations show that, compared with state-of-the-art approaches, our method achieves a 48% reduction in decoding latency and maintains an expert cache hit rate above 60%, all while preserving nearly lossless accuracy. MoE architectures offer a promising approach for deploying Large Language Models (LLMs) on edge devices, addressing an increasingly critical need [31], [30], [22]. Y et, edge servers are often limited in computational capacity and GPU memory, restricting full model deployment and rapid [32], [39]. Compared with dense models that compute all parameters for every input, MoE architectures mitigate these constraints by partitioning feed-forward layers into multiple experts [19], activating only a sparse subset per token. This design thus can drastically reduces computation overhead. However, GPU memory limitations introduce a new bottleneck: experts must frequently be offloaded to CPU memory and repeatedly loaded back to the GPU, resulting in substantial inference latency.
arXiv.org Artificial Intelligence
Nov-20-2025
- Country:
- Asia > China
- Jiangsu Province > Nanjing (0.04)
- Europe > Denmark
- Capital Region > Copenhagen (0.04)
- Asia > China
- Genre:
- Research Report > Promising Solution (0.54)
- Industry:
- Information Technology (1.00)
- Technology: