Goto

Collaborating Authors

 Yang, Lijie


TidalDecode: Fast and Accurate LLM Decoding with Position Persistent Sparse Attention

arXiv.org Artificial Intelligence

Large language models (LLMs) have driven significant advancements across diverse NLP tasks, with long-context models gaining prominence for handling extended inputs. However, the expanding key-value (KV) cache size required by Transformer architectures intensifies the memory constraints, particularly during the decoding phase, creating a significant bottleneck. Existing sparse attention mechanisms designed to address this bottleneck have two limitations: (1) they often fail to reliably identify the most relevant tokens for attention, and (2) they overlook the spatial coherence of token selection across consecutive Transformer layers, which can lead to performance degradation and substantial overhead in token selection. This paper introduces TidalDecode, a simple yet effective algorithm and system for fast and accurate LLM decoding through position persistent sparse attention. TidalDecode leverages the spatial coherence of tokens selected by existing sparse attention methods and introduces a few token selection layers that perform full attention to identify the tokens with the highest attention scores, while all other layers perform sparse attention with the pre-selected tokens. This design enables TidalDecode to substantially reduce the overhead of token selection for sparse attention without sacrificing the quality of the generated results. Evaluation on a diverse set of LLMs and tasks shows that TidalDecode closely matches the generative performance of full attention methods while reducing the LLM decoding latency by up to 2.1 Large language models (LLMs) have revolutionized natural language processing (NLP) by achieving state-of-the-art performance on various applications. As LLMs evolve, they are increasingly being adapted to manage tasks with long contexts, such as Chain-of-Thought reasoning (Wei et al., 2023), document summarization (Huang et al., 2021), and retrieval-augmented generation (Ram et al., 2023; Zhang et al., 2024b). However, quickly and efficiently serving long-context LLMs is challenging due to the inherent memory and compute bottlenecks in the Transformer architectures (Vaswani et al., 2023).


Accelerating Retrieval-Augmented Language Model Serving with Speculation

arXiv.org Artificial Intelligence

Retrieval-augmented language models (RaLM) have demonstrated the potential to solve knowledge-intensive natural language processing (NLP) tasks by combining a non-parametric knowledge base with a parametric language model. Among various RaLM approaches, iterative RaLM delivers a better generation quality due to a more frequent interaction between the retriever and the language model. Despite the benefits, iterative RaLM usually encounters high overheads due to the frequent retrieval step. To this end, we propose RaLMSpec, a speculation-inspired framework that provides generic speed-up over iterative RaLM while preserving the same model outputs through speculative retrieval and batched verification. By further incorporating prefetching, optimal speculation stride scheduler, and asynchronous verification, RaLMSpec can automatically exploit the acceleration potential to the fullest. For naive iterative RaLM serving, extensive evaluations over three language models on four downstream QA datasets demonstrate that RaLM-Spec can achieve a speed-up ratio of 1.75-2.39 For KNN-LM serving, RaLMSpec can achieve a speed-up ratio up to 7.59 and 2.45 when the retriever is an exact dense retriever and approximate dense retriever, respectively, compared with the baseline. Recent advancements in large language models such as LLaMA-2, GPT-3, and PaLM have shown promising results in diverse NLP tasks (Touvron et al., 2023; Brown et al., 2020; Chowdhery et al., 2022). However, encoding a massive amount of knowledge into a fully parametric model requires excessive effort in both training and deployment. The situation can be further exacerbated when the foundation model is required to adapt to new data or various downstream tasks (Asai et al., 2023). To address this challenge, recent work introduces retrieval-augmented language models (RaLM), which integrate the parametric language model with a non-parametric knowledge base through retrieval augmentation (Khandelwal et al., 2019; Shi et al., 2023; Ram et al., 2023; Khattab et al., 2022).


SpecInfer: Accelerating Generative Large Language Model Serving with Speculative Inference and Token Tree Verification

arXiv.org Artificial Intelligence

This approach is also called autoregressive decoding because each The high computational and memory requirements of generative generated token is also used as input for generating future large language models (LLMs) make it challenging tokens. This dependency between tokens is crucial for many to serve them quickly and cheaply. This paper introduces NLP tasks that require preserving the order and context of the SpecInfer, an LLM serving system that accelerates generative generated tokens, such as text completion [53]. LLM inference with speculative inference and token tree Existing LLM systems generally use an incremental decoding verification. A key insight behind SpecInfer is to combine approach to serving a request where the system computes various collectively boost-tuned small language models to the activations for all prompt tokens in a single step and then jointly predict the LLM's outputs; the predictions are organized iteratively decodes one new token using the input prompt as a token tree, whose nodes each represent a candidate and all previously generated tokens. This approach respects token sequence. The correctness of all candidate token sequences data dependencies between tokens, but achieves suboptimal represented by a token tree is verified against the runtime performance and limited GPU utilization, since the LLM in parallel using a novel tree-based parallel decoding degree of parallelism within each request is greatly limited in mechanism. SpecInfer uses an LLM as a token tree verifier the incremental phase. In addition, the attention mechanism of instead of an incremental decoder, which significantly Transformer [46] requires accessing the keys and values of all reduces the end-to-end latency and computational requirement previous tokens to compute the attention output of a new token.