Goto

Collaborating Authors

 flashattention





FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Neural Information Processing Systems

Transformers are slow and memory-hungry on long sequences, since the time and memory complexity of self-attention are quadratic in sequence length. Approximate attention methods have attempted to address this problem by trading off model quality to reduce the compute complexity, but often do not achieve wall-clock speedup. We argue that a missing principle is making attention algorithms IO-aware---accounting for reads and writes between levels of GPU memory. We propose FlashAttention, an IO-aware exact attention algorithm that uses tiling to reduce the number of memory reads/writes between GPU high bandwidth memory (HBM) and GPU on-chip SRAM. We analyze the IO complexity of FlashAttention, showing that it requires fewer HBM accesses than standard attention, and is optimal for a range of SRAM sizes. We also extend FlashAttention, yielding an approximate attention algorithm that is faster than any existing approximate attention method.


Chronicals: A High-Performance Framework for LLM Fine-Tuning with 3.51x Speedup over Unsloth

Nair, Arjun S.

arXiv.org Machine Learning

Large language model fine-tuning is bottlenecked by memory: a 7B parameter model requires 84GB--14GB for weights, 14GB for gradients, and 56GB for FP32 optimizer states--exceeding even A100-40GB capacity. We present Chronicals, an open-source training framework achieving 3.51x speedup over Unsloth through four synergistic optimizations: (1) fused Triton kernels eliminating 75% of memory traffic via RMSNorm (7x), SwiGLU (5x), and QK-RoPE (2.3x) fusion; (2) Cut Cross-Entropy reducing logit memory from 5GB to 135MB through online softmax computation; (3) LoRA+ with theoretically-derived 16x differential learning rates between adapter matrices; and (4) Best-Fit Decreasing sequence packing recovering 60-75% of compute wasted on padding. On Qwen2.5-0.5B with A100-40GB, Chronicals achieves 41,184 tokens/second for full fine-tuning versus Unsloth's 11,736 tokens/second (3.51x). For LoRA at rank 32, we reach 11,699 tokens/second versus Unsloth MAX's 2,857 tokens/second (4.10x). Critically, we discovered that Unsloth's reported 46,000 tokens/second benchmark exhibited zero gradient norms--the model was not training. We provide complete mathematical foundations: online softmax correctness proofs, FlashAttention IO complexity bounds O(N^2 d^2 M^{-1}), LoRA+ learning rate derivations from gradient magnitude analysis, and bin-packing approximation guarantees. All implementations, benchmarks, and proofs are available at https://github.com/Ajwebdevs/Chronicals with pip installation via https://pypi.org/project/chronicals/.


Fast Attention Over Long Sequences With Dynamic Sparse Flash Attention

Neural Information Processing Systems

Transformer-based language models have found many diverse applications requiring them to process sequences of increasing length. For these applications, the causal self-attention---which is the only component scaling quadratically w.r.t. the sequence length---becomes a central concern. While many works have proposed schemes to sparsify the attention patterns and reduce the computational overhead of self-attention, those are often limited by implementation concerns and end up imposing a simple and static structure over the attention matrix. Conversely, implementing more dynamic sparse attention often results in runtimes significantly slower than computing the full attention using the Flash implementation from Dao et al. (2022). We extend FlashAttention to accommodate a large class of attention sparsity patterns that, in particular, encompass key/query dropping and hashing-based attention. This leads to implementations with no computational complexity overhead and a multi-fold runtime speedup on top of FlashAttention. Even with relatively low degrees of sparsity, our method improves visibly upon FlashAttention as the sequence length increases. Without sacrificing perplexity, we increase the training speed of a transformer language model by $2.0\times$ and $3.3\times$ for sequences of respectively $8k$ and $16k$ tokens.


SystolicAttention: Fusing FlashAttention within a Single Systolic Array

Lin, Jiawei, Li, Yuanlong, Chen, Guokai, Bourgeat, Thomas

arXiv.org Artificial Intelligence

Transformer models rely heavily on the scaled dot-product attention (SDPA) operation, typically implemented as FlashAttention. Characterized by its frequent interleaving of matrix multiplications and softmax operations, FlashAttention fails to fully utilize the compute resources of modern systolic-array-based accelerators designed for consecutive and large matrix multiplications. To fully unleash the performance potential of systolic arrays for FlashAttention, we propose FSA, an enhanced systolic array architecture that runs the entire FlashAttention on the array without external vector units. Combined with SystolicAttention, an optimized kernel for FSA that achieves fine-grained and element-wise overlapping of FlashAttention operations, FSA maximizes array utilization while preserving the original floating-point operation order of FlashAttention. We implement FSA in synthesizable RTL and evaluate its performance against state-of-the-art systolic-array-based accelerators. Our results show that FSA achieves 1.77x and 4.83x higher attention FLOPs/s utilization compared to AWS Neuron-v2 and Google TPUv5e, respectively. We synthesize FSA in a 16 nm technology at 1.5 GHz, and results indicate only a 12% area overhead compared to a standard weight-stationary systolic array.


Block Sparse Flash Attention

Ohayon, Daniel, Lamprecht, Itay, Hubara, Itay, Cohen, Israel, Soudry, Daniel, Elata, Noam

arXiv.org Artificial Intelligence

Modern large language models increasingly require long contexts for reasoning and multi-document tasks, but attention's quadratic complexity creates a severe computational bottleneck. We present Block-Sparse FlashAttention (BSFA), a drop-in replacement that accelerates long-context inference while preserving model quality. Unlike methods that predict importance before computing scores, BSFA computes exact query-key similarities to select the top-k most important value blocks for each query. By comparing per-block maximum scores against calibrated thresholds, we skip approximately 50% of the computation and memory transfers for pruned blocks. Our training-free approach requires only a one-time threshold calibration on a small dataset to learn the per-layer and per-head attention score distributions. We provide a CUDA kernel implementation that can be used as a drop-in replacement for FlashAttention. On Llama-3.1-8B, BSFA achieves up to 1.10x speedup on real-world reasoning benchmarks and up to 1.24x for needle-in-a-haystack retrieval tasks while maintaining above 99% baseline accuracy, with certain configurations even improving accuracy by focusing on the most relevant content, substantially outperforming existing sparse attention methods. The implementation is available at https://github.com/Danielohayon/Block-Sparse-Flash-Attention


Differentiable Entropy Regularization: A Complexity-Aware Approach for Neural Optimization

Shihab, Ibne Farabi, Akter, Sanjeda, Sharma, Anuj

arXiv.org Artificial Intelligence

We introduce the first differentiable approximation of range-partition entropy, a complexity measure from computational geometry that directly bounds algorithmic runtime. Unlike architectural modifications, our method is a complementary regularizer that provides orthogonal efficiency gains when combined with existing optimizations. We establish theoretical guarantees in computational geometry, achieving 4-5 provable speedups on convex hull and triangulation with <0.2% error. On ImageNet-1K with ViT -Base, entropy regularization achieves 80.1% top-1 accuracy at 80% sparsity (1.60 standalone speedup), and when combined with FlashAttention yields 2.07 speedup versus 1.63 for FlashAttention alone. On large language models (LLaMA-2 7B, Mistral-7B, Phi-2), we achieve 1.48-1.60 Unlike prior regularization methods that target output distributions, we directly minimize representation complexity, yielding both efficiency gains and improved robustness through semantically structured sparsity patterns (IoU 0.73 vs 0.41 for magnitude pruning, CIFAR-100-C mCE 48.7 vs 55.4). Benefits are strongest for geometry and vision transformers, with more modest but measurable gains on LLMs, demonstrating that complexity regularization offers a principled pathway to joint efficiency-robustness optimization. Modern deep networks achieve impressive performance but face two critical challenges. They are fragile under distribution shift (Hendrycks & Dietterich, 2019) and require prohibitive computational costs (Strubell et al., 2019). The standard approach treats these problems independently, addressing robustness through data augmentation and efficiency through architectural changes. We ask: can a single principle address both? Our strongest results are in computational geometry and vision transformers; for large language models the overhead-benefit tradeoff is more nuanced, with improvements primarily in high-throughput deployment settings rather than research-scale fine-tuning. The key insight comes from computational geometry. Algorithmically simple representations, those with low complexity under geometric partitions, both enable faster algorithms via instance-optimal procedures (Chan, 1996) and generalize better by avoiding spurious features (Geirhos et al., 2020). However, no existing method provides a differentiable measure of algorithmic complexity that can be optimized end-to-end during neural network training. Robustness Label smoothing Y es No No No Adversarial training Y es No No No Info-theoretic Information bottleneck Y es No No No Fixed sparse Longformer, BigBird No No No No Kernel opt. Our goal is to connect representation learning to algorithmic complexity.


Patent Language Model Pretraining with ModernBERT

Yousefiramandi, Amirhossein, Cooney, Ciaran

arXiv.org Artificial Intelligence

Transformer-based language models such as BERT have become foundational in NLP, yet their performance degrades in specialized domains like patents, which contain long, technical, and legally structured text. Prior approaches to patent NLP have primarily relied on fine-tuning general-purpose models or domain-adapted variants pretrained with limited data. In this work, we pretrain 3 domain-specific masked language models for patents, using the ModernBERT architecture and a curated corpus of over 60 million patent records. Our approach incorporates architectural optimizations, including FlashAttention, rotary embeddings, and GLU feed-forward layers. We evaluate our models on four downstream patent classification tasks. Our model, ModernBERT-base-PT, consistently outperforms the general-purpose ModernBERT baseline on three out of four datasets and achieves competitive performance with a baseline PatentBERT. Additional experiments with ModernBERT-base-VX and Mosaic-BERT-large demonstrate that scaling the model size and customizing the tokenizer further enhance performance on selected tasks. Notably, all ModernBERT variants retain substantially faster inference over - 3x that of PatentBERT - underscoring their suitability for time-sensitive applications. These results underscore the benefits of domain-specific pretraining and architectural improvements for patent-focused NLP tasks.