FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision

Neural Information Processing Systems 

Attention, as a core layer of the ubiquitous Transformer architecture, is the bottleneck for large language models and long-context applications.