SeerAttention: Self-distilled Attention Gating for Efficient Long-context Prefilling
–Neural Information Processing Systems
Attention is the cornerstone of modern Large Language Models (LLMs). Yet its quadratic complexity hinders efficiency and scalability, especially for long-context processing. A promising approach is to leverage sparsity in attention. However, existing sparsity-based solutions predominantly rely on predefined patterns or heuristics at the attention head level, struggling to adapt dynamically to different contexts efficiently. We propose SeerAttention, a simple yet effective attention mechanism that directly learns the block-level attention sparsity from the LLM itself.
Neural Information Processing Systems
Jun-12-2026, 04:07:58 GMT