Goto

Collaborating Authors

 Cheng, Xinhao


A Multi-Level Superoptimizer for Tensor Programs

arXiv.org Artificial Intelligence

For We introduce Mirage, the first multi-level superoptimizer for a given algorithm, these optimizers automatically generate tensor programs. A key idea in Mirage is Graphs, a uniform performant kernels by searching over possible strategies for representation of tensor programs at the kernel, thread block, executing the kernel on the target hardware. However, due and thread levels of the GPU compute hierarchy. Graphs to the linear algebra nature of DNNs, a tensor program can enable Mirage to discover novel optimizations that combine be represented by a wide spectrum of mathematically equivalent algebraic transformations, schedule transformations, and algorithms, and existing schedule-based optimizers only generation of new custom kernels. To navigate the large consider kernels whose algorithms are manually specified search space, Mirage introduces a pruning technique based by users, resulting in missed optimization opportunities.


FlexLLM: A System for Co-Serving Large Language Model Inference and Parameter-Efficient Finetuning

arXiv.org Artificial Intelligence

Parameter-efficient finetuning (PEFT) is a widely used technique to adapt large language models for different tasks. Service providers typically create separate systems for users to perform PEFT model finetuning and inference tasks. This is because existing systems cannot handle workloads that include a mix of inference and PEFT finetuning requests. As a result, shared GPU resources are underutilized, leading to inefficiencies. To address this problem, we present FlexLLM, the first system that can serve inference and parameter-efficient finetuning requests in the same iteration. Our system leverages the complementary nature of these two tasks and utilizes shared GPU resources to run them jointly, using a method called co-serving. To achieve this, FlexLLM introduces a novel token-level finetuning mechanism, which breaks down the finetuning computation of a sequence into smaller token-level computations and uses dependent parallelization and graph pruning, two static compilation optimizations, to minimize the memory overhead and latency for co-serving. Compared to existing systems, FlexLLM's co-serving approach reduces the activation GPU memory overhead by up to 8x, and the end-to-end GPU memory requirement of finetuning by up to 36% while maintaining a low inference latency and improving finetuning throughput. For example, under a heavy inference workload, FlexLLM can still preserve more than 80% of the peak finetuning throughput, whereas existing systems cannot make any progress with finetuning. The source code of FlexLLM is publicly available at https://github.com/flexflow/FlexFlow.


Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems

arXiv.org Artificial Intelligence

In the rapidly evolving landscape of artificial intelligence (AI), generative large language models (LLMs) stand at the forefront, revolutionizing how we interact with our data. However, the computational intensity and memory consumption of deploying these models present substantial challenges in terms of serving efficiency, particularly in scenarios demanding low latency and high throughput. This survey addresses the imperative need for efficient LLM serving methodologies from a machine learning system (MLSys) research perspective, standing at the crux of advanced AI innovations and practical system optimizations. We provide in-depth analysis, covering a spectrum of solutions, ranging from cutting-edge algorithmic modifications to groundbreaking changes in system designs. The survey aims to provide a comprehensive understanding of the current state and future directions in efficient LLM serving, offering valuable insights for researchers and practitioners in overcoming the barriers of effective LLM deployment, thereby reshaping the future of AI.


SpecInfer: Accelerating Generative Large Language Model Serving with Speculative Inference and Token Tree Verification

arXiv.org Artificial Intelligence

This approach is also called autoregressive decoding because each The high computational and memory requirements of generative generated token is also used as input for generating future large language models (LLMs) make it challenging tokens. This dependency between tokens is crucial for many to serve them quickly and cheaply. This paper introduces NLP tasks that require preserving the order and context of the SpecInfer, an LLM serving system that accelerates generative generated tokens, such as text completion [53]. LLM inference with speculative inference and token tree Existing LLM systems generally use an incremental decoding verification. A key insight behind SpecInfer is to combine approach to serving a request where the system computes various collectively boost-tuned small language models to the activations for all prompt tokens in a single step and then jointly predict the LLM's outputs; the predictions are organized iteratively decodes one new token using the input prompt as a token tree, whose nodes each represent a candidate and all previously generated tokens. This approach respects token sequence. The correctness of all candidate token sequences data dependencies between tokens, but achieves suboptimal represented by a token tree is verified against the runtime performance and limited GPU utilization, since the LLM in parallel using a novel tree-based parallel decoding degree of parallelism within each request is greatly limited in mechanism. SpecInfer uses an LLM as a token tree verifier the incremental phase. In addition, the attention mechanism of instead of an incremental decoder, which significantly Transformer [46] requires accessing the keys and values of all reduces the end-to-end latency and computational requirement previous tokens to compute the attention output of a new token.