Knowing When to Stop: Dynamic Context Cutoff for Large Language Models

Xie, Roy, Wang, Junlin, Rosu, Paul, Deng, Chunyuan, Sun, Bolun, Lin, Zihao, Dhingra, Bhuwan

Feb-2-2025–arXiv.org Artificial Intelligence

Large language models (LLMs) process entire input contexts indiscriminately, which is inefficient in cases where the information required to answer a query is localized within the context. We present dynamic context cutoff, a human-inspired method enabling LLMs to self-terminate processing upon acquiring sufficient task-relevant information. Through analysis of model internals, we discover that specific attention heads inherently encode "sufficiency signals" - detectable through lightweight classifiers - that predict when critical information has been processed. This reveals a new efficiency paradigm: models' internal understanding naturally dictates processing needs rather than external compression heuristics. Comprehensive experiments across six QA datasets (up to 40K tokens) with three model families (LLaMA/Qwen/Mistral, 1B0-70B) demonstrate 1.33x average token reduction while improving accuracy by 1.3%. Furthermore, our method demonstrates better performance with the same rate of token reduction compared to other context efficiency methods. Additionally, we observe an emergent scaling phenomenon: while smaller models require require probing for sufficiency detection, larger models exhibit intrinsic self-assessment capabilities through prompting.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

Feb-2-2025

arXiv.org PDF

Add feedback

Country:
- Asia (0.46)
- North America > United States
  - California (0.14)
  - Texas (0.14)

Genre:
- Research Report (0.82)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.95)
  - Natural Language > Large Language Model (1.00)