AITopics | flashattention-3

Collaborating Authors

flashattention-3

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision

Neural Information Processing SystemsDec-26-2025, 10:49:47 GMT

Attention, as a core layer of the ubiquitous Transformer architecture, is the bottleneck for large language models and long-context applications.

artificial intelligence, large language model, natural language, (7 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.60)

Add feedback

Accelerating LLM Inference Throughput via Asynchronous KV Cache Prefetching

Dong, Yanhao, Miao, Yubo, Li, Weinan, Zheng, Xiao, Wang, Chao, Wu, Jiesheng, Lyu, Feng

arXiv.org Artificial IntelligenceNov-11-2025

Large Language Models (LLMs) exhibit pronounced memory-bound characteristics during inference due to High Bandwidth Memory (HBM) bandwidth constraints. In this paper, we propose an L2 Cache-oriented asynchronous KV Cache prefetching method to break through the memory bandwidth bottleneck in LLM inference through computation-load overlap. By strategically scheduling idle memory bandwidth during active computation windows, our method proactively prefetches required KV Cache into GPU L2 cache, enabling high-speed L2 cache hits for subsequent accesses and effectively hiding HBM access latency within computational cycles. Extensive experiments on NVIDIA H20 GPUs demonstrate that the proposed method achieves 2.15x improvement in attention kernel efficiency and up to 1.97x end-to-end throughput enhancement, surpassing state-of-the-art baseline FlashAttention-3. Notably, our solution maintains orthogonality to existing optimization techniques and can be integrated with current inference frameworks, providing a scalable latency-hiding solution for next-generation LLM inference engines.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2504.06319

Country:

North America > United States > California > Santa Clara County > Santa Clara (0.04)
Asia > China > Hunan Province > Changsha (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.71)

Add feedback

FlashMLA-ETAP: Efficient Transpose Attention Pipeline for Accelerating MLA Inference on NVIDIA H20 GPUs

Dege, Pengcuo, Luo, Qiuming, Mao, Rui, Kong, Chang

arXiv.org Artificial IntelligenceJun-5-2025

Efficient inference of Multi-Head Latent Attention (MLA) is challenged by deploying the DeepSeek-R1 671B model on a single Multi-GPU server. This paper introduces FlashMLA-ETAP, a novel framework that enhances MLA inference for the single-instance deployment scenario on NVIDIA H20 GPUs. We propose the Efficient Transpose Attention Pipeline (ETAP), which reconfigures attention computation through transposition to align the KV context length with the \(M\)-dimension in WGMMA operations, significantly reducing redundant computations. FlashMLA-ETAP achieves a 2.78x speedup over FlashMLA at 64K sequence length (batch size 16), with 5.24x and 4.94x improvements over FlashAttention-3 and FlashInfer, respectively, while maintaining numerical stability with a 15.2x lower RMSE (\(1.25 \times 10^{-5}\)) than FlashAttention-3. Furthermore, ETAP's design enables seamless integration into frameworks like FlashAttention-3 and FlashInfer, supported by a detailed theoretical analysis. Our work addresses a critical gap in resource-constrained inference, offering a scalable solution for mid-tier GPUs and paving the way for broader adoption in hardware-aware optimization. Code is available at https://github.com/pengcuo/FlashMLA-ETAP.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2506.01969

Country: North America > United States > Minnesota (0.28)

Genre: Research Report (0.66)

Industry: Information Technology > Hardware (0.62)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.92)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.91)

Add feedback

FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision

Neural Information Processing SystemsMay-27-2025, 06:33:48 GMT

Attention, as a core layer of the ubiquitous Transformer architecture, is the bottleneck for large language models and long-context applications. However, it has yet to take advantage of new capabilities present in recent hardware, with FlashAttention-2 achieving only 35% utilization on the H100 GPU.We develop three main techniques to speed up attention on Hopper GPUs: exploiting asynchrony of the Tensor Cores and TMA to (1) overlap overall computation and data movement via warp-specialization and (2) interleave block-wise matmul and softmax operations, and (3) block quantization and incoherent processing that leverages hardware support for FP8 low-precision. We demonstrate that our method, FlashAttention-3, achieves speedup on H100 GPUs by 1.5-2.0 We validate that FP8 FlashAttention-3 achieves 2.6 \times lower numerical error than a baseline FP8 attention.

large language model, machine learning, natural language, (6 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.83)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.64)

Add feedback

Accelerating Direct Preference Optimization with Prefix Sharing

Wang, Franklin, Hegde, Sumanth

arXiv.org Artificial IntelligenceOct-30-2024

Offline paired preference optimization algorithms have become a popular approach for fine-tuning on preference data, outperforming traditional supervised fine-tuning in various tasks. However, traditional implementations often involve redundant computations, especially for tasks with long shared prompts. We introduce prefix sharing for preference tuning, a novel technique that processes chosen and rejected responses as one sequence with a shared prefix. To prevent cross-response contamination, we use a custom block-sparse attention mask. Our method achieves 1.1-1.5 improvement in training throughput on popular DPO datasets, without any effect on convergence. When combined with sequence packing, we observe consistent 1.3-1.6 speedups, benefiting even datasets with smaller sequence lengths. While we focus on Direct Preference Optimization (DPO), our approach is applicable to other paired preference tuning methods. By enhancing computational efficiency, our work contributes to making preference-based fine-tuning more accessible for a wider range of applications and model sizes.

flexattention, prefix, sequence, (12 more...)

arXiv.org Artificial Intelligence

2410.20305

Country:

North America > United States > Virginia (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (0.99)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.48)

Add feedback

FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision

Shah, Jay, Bikshandi, Ganesh, Zhang, Ying, Thakkar, Vijay, Ramani, Pradeep, Dao, Tri

arXiv.org Artificial IntelligenceJul-12-2024

For the Transformer architecture [59], the attention mechanism constitutes the primary computational bottleneck, since computing the self-attention scores of queries and keys has quadratic scaling in the sequence length. Scaling attention to longer context will unlock new capabilities (modeling and reasoning over multiple long documents [24, 43, 50] and files in large codebases [30, 48]), new modalities (high-resolution images [11], audio [23], video [25]), and new applications (user interaction with long history [53], agent workflow with long horizon [62]). This has generated significant interest in making attention faster in the long-context regime, including by approximation [14, 27, 56] and software optimization ([17, 29, 45]), or even alternative architectures [22, 42, 55]. In this work, we build on the work of Dao et al. [17] on developing exact-attention algorithms that integrate knowledge of the GPU's execution model and hardware characteristics into their high-level design. In [17], Dao et al. introduced FlashAttention, a novel tiling strategy for parallelizing attention that eliminates intermediate reads/writes to slow global memory through fusing all of the attention operations into a single GPU kernel. Dao [15] restructured the algorithm as FlashAttention-2 to also parallelize over the sequence length dimension and perform the inner loop of the forward pass over blocks of the key and value matrices, thus improving the occupancy and distribution of work on the GPU. However, we observe that FlashAttention-2 nonetheless achieves poor utilization on newer GPUs relative to optimized matrix-multiplication (GEMM) kernels, such as 35% vs. 80-90% on the Hopper H100 GPU. Partially, this may be attributed to implementation-level differences, such as not using Hopper-specific instructions in place of Ampere ones when targeting the Tensor Cores. Several work such as ThunkerKitten [52] and cuDNN 9 [39] has shown that with Hopper-specific instructions and tile-based abstractions, one can speedup attention computation and simplify the implementation.

arxiv preprint arxiv, flashattention-3, wgmma, (14 more...)

arXiv.org Artificial Intelligence

2407.08608

Country:

South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
North America > United States > New York > New York County > New York City (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Hardware (1.00)
Information Technology > Graphics (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.89)

Add feedback