Scaling Context Requires Rethinking Attention

Gelada, Carles, Buckman, Jacob, Zhang, Sean, Bach, Txus

Jul-8-2025–arXiv.org Artificial Intelligence

We argue that neither transformers nor sub-quadratic architectures are well suited to training at long sequence lengths: the cost of processing the context is too expensive in the former, too inexpensive in the latter. Approaches such as sliding window attention which reduce the cost-per-token of a transformer impair in-context learning, and so are also unsuitable. To address these limitations, we introduce power attention, an architectural layer for linear-cost sequence modeling whose state size can be adjusted independently of parameters, unlocking the advantages of linear attention on practical domains. We develop and open-source a set of GPU kernels for efficient power attention, identifying a novel pattern of operation fusion to avoid memory and bandwidth bottlenecks. Our experiments on the in-context learning of power attention shows that these models dominate both exponential attention and linear attention at long-context training.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

Jul-8-2025

arXiv.org PDF

Add feedback

Country:
- Africa > Zambia
  - Southern Province > Choma (0.04)
- Asia
  - Japan > Honshū
    - Chūbu > Toyama Prefecture > Toyama (0.04)
  - Middle East
    - Israel (0.04)
    - Jordan (0.04)
    - Saudi Arabia > Asir Province
      - Abha (0.04)
  - Myanmar > Tanintharyi Region
    - Dawei (0.04)
- North America > United States (0.28)

Genre:
- Research Report (0.82)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)
  - Natural Language
    - Chatbot (0.68)
    - Large Language Model (1.00)