Accelerating Retrieval-Augmented Language Model Serving with Speculation

Zhang, Zhihao, Zhu, Alan, Yang, Lijie, Xu, Yihua, Li, Lanting, Phothilimthana, Phitchaya Mangpo, Jia, Zhihao

arXiv.org Artificial Intelligence 

Retrieval-augmented language models (RaLM) have demonstrated the potential to solve knowledge-intensive natural language processing (NLP) tasks by combining a non-parametric knowledge base with a parametric language model. Among various RaLM approaches, iterative RaLM delivers a better generation quality due to a more frequent interaction between the retriever and the language model. Despite the benefits, iterative RaLM usually encounters high overheads due to the frequent retrieval step. To this end, we propose RaLMSpec, a speculation-inspired framework that provides generic speed-up over iterative RaLM while preserving the same model outputs through speculative retrieval and batched verification. By further incorporating prefetching, optimal speculation stride scheduler, and asynchronous verification, RaLMSpec can automatically exploit the acceleration potential to the fullest. For naive iterative RaLM serving, extensive evaluations over three language models on four downstream QA datasets demonstrate that RaLM-Spec can achieve a speed-up ratio of 1.75-2.39 For KNN-LM serving, RaLMSpec can achieve a speed-up ratio up to 7.59 and 2.45 when the retriever is an exact dense retriever and approximate dense retriever, respectively, compared with the baseline. Recent advancements in large language models such as LLaMA-2, GPT-3, and PaLM have shown promising results in diverse NLP tasks (Touvron et al., 2023; Brown et al., 2020; Chowdhery et al., 2022). However, encoding a massive amount of knowledge into a fully parametric model requires excessive effort in both training and deployment. The situation can be further exacerbated when the foundation model is required to adapt to new data or various downstream tasks (Asai et al., 2023). To address this challenge, recent work introduces retrieval-augmented language models (RaLM), which integrate the parametric language model with a non-parametric knowledge base through retrieval augmentation (Khandelwal et al., 2019; Shi et al., 2023; Ram et al., 2023; Khattab et al., 2022).