CompLLM: Compression for Long Context Q&A
Berton, Gabriele, Unnikrishnan, Jayakrishnan, Tran, Son, Shah, Mubarak
–arXiv.org Artificial Intelligence
While soft context compression methods, which map input text to smaller latent representations, have shown promise, their real-world adoption is limited. Existing techniques typically compress the context as a single unit, which leads to quadratic compression complexity and an inability to reuse computations across queries with overlapping contexts. In this work, we introduce CompLLM, a soft compression technique designed for practical deployment. Instead of processing the context holistically, CompLLM divides it into segments and compresses each one independently. This simple design choice yields three critical properties: efficiency, as the compression step scales linearly with the context length; scalability, enabling models trained on short sequences (e.g., 1k tokens) to generalize to contexts of 100k tokens; and reusability, allowing compressed segments to be cached and reused across different queries. Our experiments show that with a 2x compression rate, at high context lengths CompLLM speeds up Time To First Token (TTFT) by up to 4x and reduces the KV cache size by 50%. Furthermore, CompLLM achieves performance comparable to that obtained with the uncompressed context, and even surpasses it on very long sequences, demonstrating its effectiveness and practical utility. LOFT is a long context benchmark (128k tokens) designed to stress-test the long context capabilities of frontiers LLMs as Gemini 1.5 Pro, GPT -4o, and Claude 3 Opus. With CompLLM we show that we can improve long context capabilities of much smaller open source LLMs. Figure 1: At high context lengths, CompLLM leads to considerable speedup and improved results, without requiring any modification or tuning of the LLM, by efficiently reducing the number of embeddings fed to the LLM. The plot shows the Time To First Token (TTFT) with CompLLM and without it (i.e. with a standard pipeline) as a function of context length. Among the many use cases of LLMs, one of the most popular is long context Q&A: given a textual context of arbitrary length, the LLM should answer questions about it. Applications include coding assistants reading large codebases (Team, 2024), web agents reasoning on HTML pages (Zeng et al., 2024), users querying an LLM about a set of documents (Liu et al., 2024a), or RAG systems 1 Due to the quadratic complexity of the transformer (V aswani et al., 2017), processing long contexts can be unfeasibly expensive: it is therefore important to reduce computational complexity, especially as contexts grows longer and longer.
arXiv.org Artificial Intelligence
Sep-24-2025
- Country:
- Asia
- Europe
- Austria > Vienna (0.14)
- Denmark > Capital Region
- Copenhagen (0.04)
- Monaco (0.04)
- North America > United States
- Florida > Miami-Dade County
- Miami (0.04)
- Minnesota > Hennepin County
- Minneapolis (0.14)
- New Mexico > Bernalillo County
- Albuquerque (0.04)
- Texas > Travis County
- Austin (0.04)
- Florida > Miami-Dade County
- Genre:
- Research Report (0.64)
- Industry:
- Banking & Finance (0.46)
- Education (0.47)
- Health & Medicine (0.47)
- Leisure & Entertainment (0.68)
- Technology: