SmartCache: Context-aware Semantic Cache for Efficient Multi-turn LLMInference

Jun-23-2026, 03:59:29 GMT–Neural Information Processing Systems

Large Language Models (LLMs) for multi-turn conversations suffer from inefficiency: semantically similar queries across different user sessions trigger redundant computation and duplicate memory-intensive Key-Value (KV) caches. Existing optimizations such as prefix caching overlook semantic similarities, while typical semantic caches either ignore conversational context or are not integrated with low-level KV cache management. We propose SmartCache, a system-algorithm co-design framework that tackles this inefficiency by exploiting semantic query similarity across sessions. SmartCache leverages a Semantic Forest structure to hierarchically index conversational turns, enabling efficient retrieval and reuse of responses only when both the semantic query and conversational context match. To maintain accuracy during topic shifts, it leverages internal LLM attention scores--computed during standard prefill--to dynamically detect context changes with minimal computational overhead.

large language model, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Jun-23-2026, 03:59:29 GMT

Conferences PDF

Add feedback

Country:
- North America > United States (0.47)
- Asia > China (0.28)

Genre:
- Research Report > Experimental Study (1.00)

Industry:
- Information Technology > Security & Privacy (0.68)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.94)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found