Twilight: Adaptive Attention Sparsity with Hierarchical Top-p Pruning

Jun-21-2026, 17:46:16 GMT–Neural Information Processing Systems

Leveraging attention sparsity to accelerate long-context large language models (LLMs) has been of great importance recently. However, most existing sparse attention algorithms use a fixed budget of how many tokens to use in their computations. This simple static decision raises critical issues in real-world deployment because it fails to account for the dynamic nature of real-world scenarios, where the optimal balance between accuracy and efficiency can vary greatly. In this paper, we reveal a key insight that leveraging the idea of top-p sampling (a.k.a., nucleus sampling) in sparse attention could enable efficient and adaptive budget decisions. Based on this, we propose Twilight, a framework that enhances any existing sparse attention algorithm with adaptive budget decision capabilities without sacrificing accuracy. Empirical results show that Twilight can adaptively prune up to 98% tokens with nearly no accuracy loss in both long-and medium-context scenarios, leading to a 1.4 speedup over state-of-the-art sparse attention mechanisms.

large language model, machine learning, natural language, (19 more...)

Neural Information Processing Systems

Jun-21-2026, 17:46:16 GMT

Conferences PDF

Add feedback

Country:
- Asia (0.67)
- North America > United States
  - California (0.28)

Genre:
- Research Report
  - New Finding (1.00)
  - Experimental Study (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.68)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found