Twilight: Adaptive Attention Sparsity with Hierarchical Top-p Pruning
–Neural Information Processing Systems
Leveraging attention sparsity to accelerate long-context large language models (LLMs) has been of great importance recently. However, most existing sparse attention algorithms use a fixed budget of how many tokens to use in their computations. This simple static decision raises critical issues in real-world deployment because it fails to account for the dynamic nature of real-world scenarios, where the optimal balance between accuracy and efficiency can vary greatly. In this paper, we reveal a key insight that leveraging the idea of top-p sampling (a.k.a., nucleus sampling) in sparse attention could enable efficient and adaptive budget decisions. Based on this, we propose Twilight, a framework that enhances any existing sparse attention algorithm with adaptive budget decision capabilities without sacrificing accuracy. Empirical results show that Twilight can adaptively prune up to 98% tokens with nearly no accuracy loss in both long-and medium-context scenarios, leading to a 1.4 speedup over state-of-the-art sparse attention mechanisms.
Neural Information Processing Systems
Jun-21-2026, 17:46:16 GMT
- Country:
- Asia (0.67)
- North America > United States
- California (0.28)
- Genre:
- Research Report
- New Finding (1.00)
- Experimental Study (1.00)
- Research Report
- Technology: