AITopics | eviction

Collaborating Authors

eviction

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

8f9d459c19b59b5400ce396e0f8c23e0-Paper-Conference.pdf

Neural Information Processing SystemsFeb-16-2026, 14:40:53 GMT

artificial intelligence, data mining, machine learning, (21 more...)

Neural Information Processing Systems

Country:

North America > United States > Arizona > Maricopa County > Phoenix (0.04)
Asia > Middle East > Israel > Tel Aviv District > Tel Aviv (0.04)
North America > United States > Oregon > Multnomah County > Portland (0.04)
(3 more...)

Genre: Research Report > Experimental Study (0.93)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Data Science > Data Mining > Big Data (0.46)

Add feedback

SkipKV: Selective Skipping of KV Generation and Storage for Efficient Inference with Large Reasoning Models

Tian, Jiayi, Azizi, Seyedarmin, Zhao, Yequan, Potraghloo, Erfan Baghaei, McPherson, Sean, Sridhar, Sharath Nittur, Wang, Zhengyang, Zhang, Zheng, Pedram, Massoud, Kundu, Souvik

arXiv.org Artificial IntelligenceDec-10-2025

Large reasoning models (LRMs) often cost significant key-value (KV) cache overhead, due to their linear growth with the verbose chain-of-thought (CoT) reasoning process. This costs both memory and throughput bottleneck limiting their efficient deployment. Towards reducing KV cache size during inference, we first investigate the effectiveness of existing KV cache eviction methods for CoT reasoning. Interestingly, we find that due to unstable token-wise scoring and the reduced effective KV budget caused by padding tokens, state-of-the-art (SoTA) eviction methods fail to maintain accuracy in the multi-batch setting. Additionally, these methods often generate longer sequences than the original model, as semantic-unaware token-wise eviction leads to repeated revalidation during reasoning. To address these issues, we present \textbf{SkipKV}, a \textbf{\textit{training-free}} KV compression method for selective \textit{eviction} and \textit{generation} operating at a coarse-grained sentence-level sequence removal for efficient CoT reasoning. In specific, it introduces a \textit{sentence-scoring metric} to identify and remove highly similar sentences while maintaining semantic coherence. To suppress redundant generation, SkipKV dynamically adjusts a steering vector to update the hidden activation states during inference enforcing the LRM to generate concise response. Extensive evaluations on multiple reasoning benchmarks demonstrate the effectiveness of SkipKV in maintaining up to $\mathbf{26.7}\%$ improved accuracy compared to the alternatives, at a similar compression budget. Additionally, compared to SoTA, SkipKV yields up to $\mathbf{1.6}\times$ fewer generation length while improving throughput up to $\mathbf{1.7}\times$.

machine learning, natural language, skipkv, (14 more...)

arXiv.org Artificial Intelligence

2512.07993

Country: North America > United States > California (0.28)

Genre: Research Report (0.84)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (0.94)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (0.48)

Add feedback

Cache What Lasts: Token Retention for Memory-Bounded KV Cache in LLMs

Bui, Ngoc, Sharma, Shubham, Lamba, Simran, Mishra, Saumitra, Ying, Rex

arXiv.org Artificial IntelligenceDec-4-2025

Memory and computation remain core bottlenecks in long-horizon LLM inference due to the quadratic cost of self-attention and the ever-growing key-value (KV) cache. Existing strategies for memory-bounded inference, such as quantization, offloading, or heuristic KV eviction, either incur high orchestration costs or rely on unreliable attention-based proxies of importance. We propose TRIM-KV, a novel approach that learns each token's intrinsic importance at creation time via a lightweight retention gate. Each gate predicts a scalar retention score that decays over time, reflecting the long-term utility of the token for a specific layer and head. Tokens with low scores are evicted when the memory budget is exceeded, ensuring that the cache always contains the most critical tokens. TRIM-KV is trained efficiently through distillation from a frozen LLM combined with a capacity loss, requiring only gate fine-tuning and adding negligible inference overhead. Across mathematical reasoning (GSM8K, MATH-500, AIME24), procedural generation (LongProc), conversational long-memory benchmarks (LongMemEval), and long-context understanding (LongBench and SCBench), TRIM-KV consistently outperforms strong eviction and learnable retrieval baselines, especially in low-memory regimes. Remarkably, it even surpasses full-cache models in some settings, showing that selective retention can serve as a form of regularization, suppressing noise from uninformative tokens. Qualitative analyses further reveal that learned retention scores align with human intuition, naturally recovering heuristics such as sink tokens, sliding windows, and gist compression without explicit design. Beyond efficiency, retention scores provide insights into layer- and head-specific roles, suggesting a new path toward LLM interpretability.

arxiv preprint arxiv, large language model, natural language, (15 more...)

arXiv.org Artificial Intelligence

2512.03324

Genre: Research Report > Promising Solution (0.34)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Lookahead Q-Cache: Achieving More Consistent KV Cache Eviction via Pseudo Query

Wang, Yixuan, Ji, Shiyu, Liu, Yijun, Xu, Yuzhuang, Xu, Yang, Zhu, Qingfu, Che, Wanxiang

arXiv.org Artificial IntelligenceNov-18-2025

Large language models (LLMs) rely on key-value cache (KV cache) to accelerate decoding by reducing redundant computations. However, the KV cache memory usage grows substantially with longer text sequences, posing challenges for efficient deployment. Existing KV cache eviction methods prune tokens using prefilling-stage attention scores, causing inconsistency with actual inference queries, especially under tight memory budgets. In this paper, we propose Lookahead Q-Cache (LAQ), a novel eviction framework that generates low-cost pseudo lookahead queries to better approximate the true decoding-stage queries. By using these lookahead queries as the observation window for importance estimation, LAQ achieves more consistent and accurate KV cache eviction aligned with real inference scenarios. Experimental results on LongBench and Needle-in-a-Haystack benchmarks show that LAQ outperforms existing methods across various budget levels, achieving a 1 $\sim$ 4 point improvement on LongBench under limited cache budget. Moreover, LAQ is complementary to existing approaches and can be flexibly combined to yield further improvements.

artificial intelligence, large language model, natural language, (16 more...)

arXiv.org Artificial Intelligence

2505.20334

Country:

Asia (0.46)
North America (0.28)
Europe > Austria > Vienna (0.14)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Robustifying Learning-Augmented Caching Efficiently without Compromising 1-Consistency

Chen, Peng, Zhao, Hailiang, Zhang, Jiaji, Tang, Xueyan, Wang, Yixuan, Deng, Shuiguang

arXiv.org Artificial IntelligenceNov-17-2025

The online caching problem aims to minimize cache misses when serving a sequence of requests under a limited cache size. While naive learning-augmented caching algorithms achieve ideal $1$-consistency, they lack robustness guarantees. Existing robustification methods either sacrifice $1$-consistency or introduce excessive computational overhead. In this paper, we introduce Guard, a lightweight robustification framework that enhances the robustness of a broad class of learning-augmented caching algorithms to $2H_{k-1} + 2$, while preserving their $1$-consistency. Guard achieves the current best-known trade-off between consistency and robustness, with only O(1) additional per-request overhead, thereby maintaining the original time complexity of the base algorithm. Extensive experiments across multiple real-world datasets and prediction models validate the effectiveness of Guard in practice.

algorithm, artificial intelligence, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2507.16242

Country:

North America > United States (0.28)
Asia (0.28)

Genre: Research Report > New Finding (0.67)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Data Science (0.92)
Information Technology > Software (0.67)
(2 more...)

Add feedback

EpiCache: Episodic KV Cache Management for Long Conversational Question Answering

Kim, Minsoo, Kundu, Arnav, Kim, Han-Byul, Dixit, Richa, Cho, Minsik

arXiv.org Artificial IntelligenceOct-14-2025

Modern large language models (LLMs) extend context lengths to millions of tokens, enabling coherent, personalized responses grounded in long conversational histories. This ability, however, hinges on Key-Value (KV) caching, whose memory grows linearly with dialogue length and quickly becomes the bottleneck in resource-constrained environments. An active line of research for reducing memory bottleneck is KV cache compression, which seeks to limit cache size while preserving accuracy. Yet existing methods face two major limitations: (i) evicting the KV cache after full-context prefill causes unbounded peak memory, and (ii) query-dependent eviction narrows the cache to a single query, leading to failure cases in multi-turn conversations. We introduce EpiCache, a training-free KV cache management framework for long conversational question answering (LongConvQA) under fixed memory budgets. EpiCache bounds cache growth through block-wise prefill and preserves topic-relevant context via episodic KV compression, which clusters conversation history into coherent episodes and applies episode-specific KV cache eviction. We further design an adaptive layer-wise budget allocation strategy that measures each layer's sensitivity to eviction and distributes the memory budget across layers accordingly. Across three LongConvQA benchmarks, EpiCache improves accuracy by up to 40%, maintains near-full KV accuracy under 4-6x compression, and reduces latency/memory by up to 2.4x/3.5x, enabling efficient multi-turn interaction under strict resource limits. Our code is available at https://github.com/apple/ml-epicache.

large language model, machine learning, question answering, (21 more...)

arXiv.org Artificial Intelligence

2509.17396

Country:

Asia (0.92)
Europe > Austria (0.28)
North America > United States (0.28)

Genre: Research Report (0.82)

Industry: Leisure & Entertainment (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Question Answering (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

8f9d459c19b59b5400ce396e0f8c23e0-Paper-Conference.pdf

Neural Information Processing SystemsOct-10-2025, 09:28:57 GMT

algorithm, cache, imbalance, (17 more...)

Neural Information Processing Systems

Country:

North America > United States > Arizona > Maricopa County > Phoenix (0.04)
Asia > Middle East > Israel > Tel Aviv District > Tel Aviv (0.04)
North America > United States > Texas > El Paso County > El Paso (0.04)
(4 more...)

Genre: Research Report > Experimental Study (0.93)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Data Science > Data Mining > Big Data (0.46)

Add feedback

OBCache: Optimal Brain KV Cache Pruning for Efficient Long-Context LLM Inference

Gu, Yuzhe, Liang, Xiyu, Zhao, Jiaojiao, Diao, Enmao

arXiv.org Artificial IntelligenceOct-10-2025

Large language models (LLMs) with extended context windows enable powerful downstream applications but impose significant memory overhead, as caching all key-value (KV) states scales linearly with sequence length and batch size. Existing cache eviction methods address this by exploiting attention sparsity, yet they typically rank tokens heuristically using accumulated attention weights without considering their true impact on attention outputs. We propose Optimal Brain Cache (OBCache), a principled framework that formulates cache eviction as a layer-wise structured pruning problem. Building upon the Optimal Brain Damage (OBD) theory, OBCache quantifies token saliency by measuring the perturbation in attention outputs induced by pruning tokens, with closed-form scores derived for isolated keys, isolated values, and joint key-value pairs. Our scores account not only for attention weights but also for information from value states and attention outputs, thereby enhancing existing eviction strategies with output-aware signals. Experiments on LLaMA and Qwen models demonstrate that replacing the heuristic scores in existing works, which estimate token saliency across different query positions, with OBCache's output-aware scores consistently improves long-context accuracy.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2510.07651

Genre: Research Report > New Finding (0.46)

Industry: Health & Medicine (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Filters

Collaborating Authors

eviction

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

0ea048312aa812b2711fe765a9e9ef05-Paper-Conference.pdf

8f9d459c19b59b5400ce396e0f8c23e0-Paper-Conference.pdf

0ea048312aa812b2711fe765a9e9ef05-Paper-Conference.pdf

SkipKV: Selective Skipping of KV Generation and Storage for Efficient Inference with Large Reasoning Models

Cache What Lasts: Token Retention for Memory-Bounded KV Cache in LLMs

Lookahead Q-Cache: Achieving More Consistent KV Cache Eviction via Pseudo Query

Robustifying Learning-Augmented Caching Efficiently without Compromising 1-Consistency

EpiCache: Episodic KV Cache Management for Long Conversational Question Answering

8f9d459c19b59b5400ce396e0f8c23e0-Paper-Conference.pdf

OBCache: Optimal Brain KV Cache Pruning for Efficient Long-Context LLM Inference