Huang, Yeqi
WaferLLM: A Wafer-Scale LLM Inference System
He, Congjie, Huang, Yeqi, Mu, Pei, Miao, Ziming, Xue, Jilong, Ma, Lingxiao, Yang, Fan, Mai, Luo
Emerging AI accelerators increasingly adopt wafer-scale manufacturing technologies, integrating hundreds of thousands of AI cores in a mesh-based architecture with large distributed on-chip memory (tens of GB in total) and ultra-high on-chip memory bandwidth (tens of PB/s). However, current LLM inference systems, optimized for shared memory architectures like GPUs, fail to fully exploit these accelerators. We introduce WaferLLM, the first wafer-scale LLM inference system. WaferLLM is guided by a novel PLMR model (pronounced as "Plummer") that captures the unique hardware characteristics of wafer-scale architectures. Leveraging this model, WaferLLM pioneers wafer-scale LLM parallelism, optimizing the utilization of hundreds of thousands of on-chip cores. It also introduces MeshGEMM and MeshGEMV, the first GEMM and GEMV implementations designed to scale effectively on wafer-scale accelerators. Evaluations show that WaferLLM achieves 200$\times$ better wafer-scale accelerator utilization than state-of-the-art systems. On a commodity wafer-scale accelerator, WaferLLM delivers 606$\times$ faster and 22$\times$ more energy-efficient GEMV compared to an advanced GPU. For LLMs, based on 16-bit data type, WaferLLM achieves 2700 toks/sec/req decode speed on Llama3-8B model and 840 toks/sec/req decode speed on Qwen2-72B model, which enables 39$\times$ faster decoding with 1.7$\times$ better energy efficiency. We anticipate these numbers will grow significantly as wafer-scale AI models, software, and hardware continue to mature.
MoE-CAP: Cost-Accuracy-Performance Benchmarking for Mixture-of-Experts Systems
Fu, Yao, Jiang, Yinsicheng, Huang, Yeqi, Nie, Ping, Lu, Zhan, Xue, Leyang, He, Congjie, Sit, Man-Kit, Xue, Jilong, Dong, Li, Miao, Ziming, Zou, Kai, Ponti, Edoardo, Mai, Luo
The sparse Mixture-of-Experts (MoE) architecture is increasingly favored for scaling Large Language Models (LLMs) efficiently; however, MoE systems rely on heterogeneous compute and memory resources. These factors collectively influence the system's Cost, Accuracy, and Performance (CAP), creating a challenging trade-off. Current benchmarks often fail to provide precise estimates of these effects, complicating practical considerations for deploying MoE systems. To bridge this gap, we introduce MoE-CAP, a benchmark specifically designed to evaluate MoE systems. Our findings highlight the difficulty of achieving an optimal balance of cost, accuracy, and performance with existing hardware capabilities. MoE systems often necessitate compromises on one factor to optimize the other two, a dynamic we term the MoE-CAP trade-off. To identify the best trade-off, we propose novel performance evaluation metrics - Sparse Memory Bandwidth Utilization (S-MBU) and Sparse Model FLOPS Utilization (S-MFU) - and develop cost models that account for the heterogeneous compute and memory hardware integral to MoE systems. This benchmark is publicly available on HuggingFace: https://huggingface.co/spaces/sparse-generative-ai/open-moe-llm-leaderboard.
ServerlessLLM: Locality-Enhanced Serverless Inference for Large Language Models
Fu, Yao, Xue, Leyang, Huang, Yeqi, Brabete, Andrei-Octavian, Ustiugov, Dmitrii, Patel, Yuvraj, Mai, Luo
Furthermore, LLM inference latency is difficult to predict because their response time depends on the output This paper presents ServerlessLLM, a locality-enhanced length, which can vary significantly [24, 39, 77], due to iterative serverless inference system for Large Language Models output token generation. To achieve low latency, processing (LLMs). ServerlessLLM exploits the substantial capacity and an LLM request often necessitates the use of several bandwidth of storage and memory devices available on GPU GPUs for durations ranging from seconds to minutes. In practice, servers, thereby reducing costly remote checkpoint downloads LLM service providers need to host a large number of and achieving efficient checkpoint loading. ServerlessLLM LLMs catered to different developers, leading to significant achieves this through three main contributions: (i) fast LLM GPU consumption [15] and impeding the sustainability of checkpoint loading via a novel loading-optimized checkpoint LLM services [19]. As a result, LLM inference services have format design, coupled with an efficient multi-tier checkpoint to impose strict caps on the number of requests sent to their loading system; (ii) locality-driven LLM inference with live services from their users (e.g., 40 messages per 3 hours for migration, which allows ServerlessLLM to effectively achieve ChatGPT [51]), showing the provider's current inability to locality-driven server allocation while preserving the low latency satisfy the LLM inference demand. Researchers [19] project of ongoing LLM inference; and (iii) locality-aware that LLM inference costs may increase by > 50 when it server allocation, enabling ServerlessLLM to evaluate the status reaches the popularity of Google Search. of each server in a cluster and effectively schedule model To reduce GPU consumption, LLM service providers are startup time to capitalize on local checkpoint placement. Our exploring serverless inference, as seen in systems like Amazon comprehensive experiments, which include microbenchmarks SageMaker [60], Azure [46], KServe [11] and Hugging-and real-world traces, show that ServerlessLLM surpasses Face [31].