AITopics | Huang, Yeqi

Collaborating Authors

Huang, Yeqi

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

WaferLLM: A Wafer-Scale LLM Inference System

He, Congjie, Huang, Yeqi, Mu, Pei, Miao, Ziming, Xue, Jilong, Ma, Lingxiao, Yang, Fan, Mai, Luo

arXiv.org Artificial IntelligenceFeb-18-2025

Emerging AI accelerators increasingly adopt wafer-scale manufacturing technologies, integrating hundreds of thousands of AI cores in a mesh-based architecture with large distributed on-chip memory (tens of GB in total) and ultra-high on-chip memory bandwidth (tens of PB/s). However, current LLM inference systems, optimized for shared memory architectures like GPUs, fail to fully exploit these accelerators. We introduce WaferLLM, the first wafer-scale LLM inference system. WaferLLM is guided by a novel PLMR model (pronounced as "Plummer") that captures the unique hardware characteristics of wafer-scale architectures. Leveraging this model, WaferLLM pioneers wafer-scale LLM parallelism, optimizing the utilization of hundreds of thousands of on-chip cores. It also introduces MeshGEMM and MeshGEMV, the first GEMM and GEMV implementations designed to scale effectively on wafer-scale accelerators. Evaluations show that WaferLLM achieves 200$\times$ better wafer-scale accelerator utilization than state-of-the-art systems. On a commodity wafer-scale accelerator, WaferLLM delivers 606$\times$ faster and 22$\times$ more energy-efficient GEMV compared to an advanced GPU. For LLMs, based on 16-bit data type, WaferLLM achieves 2700 toks/sec/req decode speed on Llama3-8B model and 840 toks/sec/req decode speed on Qwen2-72B model, which enables 39$\times$ faster decoding with 1.7$\times$ better energy efficiency. We anticipate these numbers will grow significantly as wafer-scale AI models, software, and hardware continue to mature.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2502.04563

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

MoE-CAP: Cost-Accuracy-Performance Benchmarking for Mixture-of-Experts Systems

Fu, Yao, Jiang, Yinsicheng, Huang, Yeqi, Nie, Ping, Lu, Zhan, Xue, Leyang, He, Congjie, Sit, Man-Kit, Xue, Jilong, Dong, Li, Miao, Ziming, Zou, Kai, Ponti, Edoardo, Mai, Luo

arXiv.org Artificial IntelligenceDec-9-2024

The sparse Mixture-of-Experts (MoE) architecture is increasingly favored for scaling Large Language Models (LLMs) efficiently; however, MoE systems rely on heterogeneous compute and memory resources. These factors collectively influence the system's Cost, Accuracy, and Performance (CAP), creating a challenging trade-off. Current benchmarks often fail to provide precise estimates of these effects, complicating practical considerations for deploying MoE systems. To bridge this gap, we introduce MoE-CAP, a benchmark specifically designed to evaluate MoE systems. Our findings highlight the difficulty of achieving an optimal balance of cost, accuracy, and performance with existing hardware capabilities. MoE systems often necessitate compromises on one factor to optimize the other two, a dynamic we term the MoE-CAP trade-off. To identify the best trade-off, we propose novel performance evaluation metrics - Sparse Memory Bandwidth Utilization (S-MBU) and Sparse Model FLOPS Utilization (S-MFU) - and develop cost models that account for the heterogeneous compute and memory hardware integral to MoE systems. This benchmark is publicly available on HuggingFace: https://huggingface.co/spaces/sparse-generative-ai/open-moe-llm-leaderboard.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2412.07067

Country: North America > United States (0.28)

Genre: Research Report > New Finding (0.34)

Industry: Government (0.88)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.34)

Add feedback

ServerlessLLM: Locality-Enhanced Serverless Inference for Large Language Models

Fu, Yao, Xue, Leyang, Huang, Yeqi, Brabete, Andrei-Octavian, Ustiugov, Dmitrii, Patel, Yuvraj, Mai, Luo

arXiv.org Artificial IntelligenceJan-25-2024

Furthermore, LLM inference latency is difficult to predict because their response time depends on the output This paper presents ServerlessLLM, a locality-enhanced length, which can vary significantly [24, 39, 77], due to iterative serverless inference system for Large Language Models output token generation. To achieve low latency, processing (LLMs). ServerlessLLM exploits the substantial capacity and an LLM request often necessitates the use of several bandwidth of storage and memory devices available on GPU GPUs for durations ranging from seconds to minutes. In practice, servers, thereby reducing costly remote checkpoint downloads LLM service providers need to host a large number of and achieving efficient checkpoint loading. ServerlessLLM LLMs catered to different developers, leading to significant achieves this through three main contributions: (i) fast LLM GPU consumption [15] and impeding the sustainability of checkpoint loading via a novel loading-optimized checkpoint LLM services [19]. As a result, LLM inference services have format design, coupled with an efficient multi-tier checkpoint to impose strict caps on the number of requests sent to their loading system; (ii) locality-driven LLM inference with live services from their users (e.g., 40 messages per 3 hours for migration, which allows ServerlessLLM to effectively achieve ChatGPT [51]), showing the provider's current inability to locality-driven server allocation while preserving the low latency satisfy the LLM inference demand. Researchers [19] project of ongoing LLM inference; and (iii) locality-aware that LLM inference costs may increase by > 50 when it server allocation, enabling ServerlessLLM to evaluate the status reaches the popularity of Google Search. of each server in a cluster and effectively schedule model To reduce GPU consumption, LLM service providers are startup time to capitalize on local checkpoint placement. Our exploring serverless inference, as seen in systems like Amazon comprehensive experiments, which include microbenchmarks SageMaker [60], Azure [46], KServe [11] and Hugging-and real-world traces, show that ServerlessLLM surpasses Face [31].

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2401.14351

Country: North America > United States (0.68)

Genre: Research Report (0.82)

Industry: Information Technology > Services (0.87)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback