Enabling MoE on the Edge via Importance-Driven Expert Scheduling

Zhu, Guoying, Li, Meng, Dai, Haipeng, Liu, Xuechen, Wang, Weijun, Li, Keran, xiao, Jun, Chen, Ligeng, Wang, Wei

arXiv.org Artificial Intelligence 

Abstract--The Mixture of Experts (MoE) architecture has emerged as a key technique for scaling Large Language Models by activating only a subset of experts per query. Deploying MoE on consumer-grade edge hardware, however, is constrained by limited device memory, making dynamic expert offloading essential. Unlike prior work that treats offloading purely as a scheduling problem, we leverage expert importance to guide decisions, substituting low-importance active experts with functionally similar ones already cached in GPU memory, thereby preserving accuracy. As a result, this design reduces memory usage and data transfer, while largely eliminating PCIe overhead. In addition, we introduce a scheduling policy that maximizes the reuse ratio of GPU-cached experts, further boosting efficiency. Our extensive evaluations show that, compared with state-of-the-art approaches, our method achieves a 48% reduction in decoding latency and maintains an expert cache hit rate above 60%, all while preserving nearly lossless accuracy. MoE architectures offer a promising approach for deploying Large Language Models (LLMs) on edge devices, addressing an increasingly critical need [31], [30], [22]. Y et, edge servers are often limited in computational capacity and GPU memory, restricting full model deployment and rapid [32], [39]. Compared with dense models that compute all parameters for every input, MoE architectures mitigate these constraints by partitioning feed-forward layers into multiple experts [19], activating only a sparse subset per token. This design thus can drastically reduces computation overhead. However, GPU memory limitations introduce a new bottleneck: experts must frequently be offloaded to CPU memory and repeatedly loaded back to the GPU, resulting in substantial inference latency.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found