Fast and Memory-Efficient Exact Attention with IO-Awareness

May-31-2025, 00:41:52 GMT–Neural Information Processing Systems

Transformers are slow and memory-hungry on long sequences, since the time and memory complexity of self-attention are quadratic in sequence length. Approximate attention methods have attempted to address this problem by trading off model quality to reduce the compute complexity, but often do not achieve wall-clock speedup. We argue that a missing principle is making attention algorithms IO-aware-- accounting for reads and writes between levels of GPU memory.

lash, machine learning, natural language, (20 more...)

Neural Information Processing Systems

May-31-2025, 00:41:52 GMT

Conferences PDF

Add feedback

Country:
- North America
  - Mexico > Mexico City (0.14)
  - United States (0.67)

Genre:
- Research Report (0.67)

Industry:
- Government > Regional Government (0.45)
- Information Technology (0.93)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.96)
  - Natural Language (1.00)
  - Representation & Reasoning (0.67)

Duplicate Docs Excel Report

Title
Fast and Memory-Efficient Exact Attention with IO-Awareness

Similar Docs Excel Report more

Title	Similarity	Source
None found