SmartCache: Context-aware Semantic Cache for Efficient Multi-turn LLMInference
–Neural Information Processing Systems
Large Language Models (LLMs) for multi-turn conversations suffer from inefficiency: semantically similar queries across different user sessions trigger redundant computation and duplicate memory-intensive Key-Value (KV) caches. Existing optimizations such as prefix caching overlook semantic similarities, while typical semantic caches either ignore conversational context or are not integrated with low-level KV cache management. We propose SmartCache, a system-algorithm co-design framework that tackles this inefficiency by exploiting semantic query similarity across sessions. SmartCache leverages a Semantic Forest structure to hierarchically index conversational turns, enabling efficient retrieval and reuse of responses only when both the semantic query and conversational context match. To maintain accuracy during topic shifts, it leverages internal LLM attention scores--computed during standard prefill--to dynamically detect context changes with minimal computational overhead.
Neural Information Processing Systems
Jun-23-2026, 03:59:29 GMT
- Country:
- North America > United States (0.47)
- Asia > China (0.28)
- Genre:
- Research Report > Experimental Study (1.00)
- Industry:
- Information Technology > Security & Privacy (0.68)
- Technology: