LLMSteer: Improving Long-Context LLM Inference by Steering Attention on Reused Contexts
Gu, Zhuohan, Yao, Jiayi, Du, Kuntai, Jiang, Junchen
–arXiv.org Artificial Intelligence
Large Language Models (LLMs) have demonstrated remarkable capabilities in complex tasks such as question answering, summarization, and reasoning (llm [a,b,c]). To enhance their reliability, LLMs are often augmented with domain-specific or user-specific knowledge that extends beyond their inherent training data (Lewis et al. [2020], Jiang et al. [2023], Chen et al. [2024]). However, incorporating these supplemental contexts, which can exceed thousands of tokens (Jin et al. [2024], Gao et al. [2023]), presents two challenges: (1) models often struggle to comprehend long context (e.g., lost-in-the-middle problem (Liu et al. [2023a], Junqing et al. [2023])) and (2) processing long context incurs substantial runtime costs (Liu et al. [2024], Lin et al. [2024], Zhong et al. [2024]). Since the Key-Value (KV) cache of the same context text chunks is often reused multiple times (Liu et al. [2023b], Yao et al. [2024], Jin et al. [2024]), many recent systems adopt prefix caching (Jin et al. [2024], Liu et al. [2023b], Qin et al. [2024]), which stores the KV caches for the frequently reused contexts such that LLMs no longer need to prefill these contexts repeatedly. However, the model persists in losing track of key information from the context as its KV pairs remain unchanged. So, is there a way to simultaneously achieve high efficiency and high quality without fine-tuning models?
arXiv.org Artificial Intelligence
Nov-21-2024