FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision

May-27-2025, 06:33:48 GMT–Neural Information Processing Systems

Attention, as a core layer of the ubiquitous Transformer architecture, is the bottleneck for large language models and long-context applications. However, it has yet to take advantage of new capabilities present in recent hardware, with FlashAttention-2 achieving only 35% utilization on the H100 GPU.We develop three main techniques to speed up attention on Hopper GPUs: exploiting asynchrony of the Tensor Cores and TMA to (1) overlap overall computation and data movement via warp-specialization and (2) interleave block-wise matmul and softmax operations, and (3) block quantization and incoherent processing that leverages hardware support for FP8 low-precision. We demonstrate that our method, FlashAttention-3, achieves speedup on H100 GPUs by 1.5-2.0 We validate that FP8 FlashAttention-3 achieves 2.6 \times lower numerical error than a baseline FP8 attention.

large language model, machine learning, natural language, (6 more...)

Neural Information Processing Systems

May-27-2025, 06:33:48 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning (0.83)
  - Natural Language > Large Language Model (0.64)