Goto

Collaborating Authors

 thread block


FlashFormer: Whole-Model Kernels for Efficient Low-Batch Inference

Nrusimha, Aniruddha, Brandon, William, Mishra, Mayank, Shen, Yikang, Panda, Rameswar, Ragan-Kelley, Jonathan, Kim, Yoon

arXiv.org Artificial Intelligence

The size and compute characteristics of modern large language models have led to an increased interest in developing specialized kernels tailored for particular training and inference workloads. Existing kernels primarily optimize for compute utilization, targeting the large-batch training and inference settings. However, low-batch inference, where memory bandwidth and kernel launch overheads are significant factors, remains important for many applications of interest such as in edge deployment and latency-sensitive applications. This paper describes FlashFormer, which fuses the entire transformer forward pass into a single kernel for accelerating low-batch inference of large language models. Across various model sizes and quantizations settings, FlashFormer achieves nontrivial speedups compared to existing inference kernels.


HipKittens: Fast and Furious AMD Kernels

Hu, William, Wadsworth, Drew, Siddens, Sean, Winata, Stanley, Fu, Daniel Y., Swann, Ryann, Osama, Muhammad, Ré, Christopher, Arora, Simran

arXiv.org Artificial Intelligence

AMD GPUs offer state-of-the-art compute and memory bandwidth; however, peak performance AMD kernels are written in raw assembly. To address the difficulty of mapping AI algorithms to hardware, recent work proposes C++ embedded and PyTorch-inspired domain-specific languages like ThunderKittens (TK) to simplify high performance AI kernel development on NVIDIA hardware. We explore the extent to which such primitives -- for explicit tile-based programming with optimized memory accesses and fine-grained asynchronous execution across workers -- are NVIDIA-specific or general. We provide the first detailed study of the programming primitives that lead to performant AMD AI kernels, and we encapsulate these insights in the HipKittens (HK) programming framework. We find that tile-based abstractions used in prior DSLs generalize to AMD GPUs, however we need to rethink the algorithms that instantiate these abstractions for AMD. We validate the HK primitives across CDNA3 and CDNA4 AMD platforms. In evaluations, HK kernels compete with AMD's hand-optimized assembly kernels for GEMMs and attention, and consistently outperform compiler baselines. Moreover, assembly is difficult to scale to the breadth of AI workloads; reflecting this, in some settings HK outperforms all available kernel baselines by $1.2-2.4\times$ (e.g., $d=64$ attention, GQA backwards, memory-bound kernels). These findings help pave the way for a single, tile-based software layer for high-performance AI kernels that translates across GPU vendors. HipKittens is released at: https://github.com/HazyResearch/HipKittens.


FSA: An Alternative Efficient Implementation of Native Sparse Attention Kernel

Yan, Ran, Jiang, Youhe, Chen, Zhuoming, Mai, Haohui, Chen, Beidi, Yuan, Binhang

arXiv.org Artificial Intelligence

Recent advance in sparse attention mechanisms has demonstrated strong potential for reducing the computational cost of long-context training and inference in large language models (LLMs). Native Sparse Attention (NSA), one state-of-the-art approach, introduces natively trainable, hardware-aligned sparse attention that delivers substantial system-level performance boost while maintaining accuracy comparable to full attention. However, the kernel implementation of NSA forces a loop order that is only efficient with a relatively large number of query heads in each Grouped Query Attention (GQA) group, whereas existing LLMs widely adopt much smaller number of query heads in each GQA group -- such an inconsistency significantly limits the applicability of this sparse algorithmic advance. In this work, we propose Flash Sparse Attention (FSA), an alternative kernel implementation that enables efficient NSA computation across a wide range of popular LLMs with varied smaller number of heads in each GQA group on modern GPUs. Compared to vanilla NSA kernel implementation, our empirical evaluation demonstrates that FSA achieves (i) up to 3.5x and on average 1.6x kernel-level latency reduction, (ii) up to 1.25x and 1.09x on average end-to-end training speedup on state-of-the-art LLMs, and (iii) up to 1.36x and 1.11x on average for prefill-phase speedup in LLM generative inference. Github Repo at https://github.com/Relaxed-System-Lab/Flash-Sparse-Attention.


ClusterFusion: Expanding Operator Fusion Scope for LLM Inference via Cluster-Level Collective Primitive

Luo, Xinhao, Liu, Zihan, Zhou, Yangjie, Fang, Shihan, Huang, Ziyu, Feng, Yu, Zhang, Chen, Sun, Shixuan, Zheng, Zhenzhe, Leng, Jingwen, Guo, Minyi

arXiv.org Artificial Intelligence

Large language model (LLM) decoding suffers from high latency due to fragmented execution across operators and heavy reliance on off-chip memory for data exchange and reduction. This execution model limits opportunities for fusion and incurs significant memory traffic and kernel launch overhead. While modern architectures such as NVIDIA Hopper provide distributed shared memory and low-latency intra-cluster interconnects, they expose only low-level data movement instructions, lacking structured abstractions for collective on-chip communication. To bridge this software-hardware gap, we introduce two cluster-level communication primitives, ClusterReduce and ClusterGather, which abstract common communication patterns and enable structured, high-speed data exchange and reduction between thread blocks within a cluster, allowing intermediate results to be on-chip without involving off-chip memory. Building on these abstractions, we design ClusterFusion, an execution framework that schedules communication and computation jointly to expand operator fusion scope by composing decoding stages such as QKV Projection, Attention, and Output Projection into a single fused kernels. Evaluations on H100 GPUs show that ClusterFusion outperforms state-of-the-art inference frameworks by 1.61x on average in end-to-end latency across different models and configurations. The source code is available at https://github.com/xinhao-luo/ClusterFusion.


Fused3S: Fast Sparse Attention on Tensor Cores

Li, Zitong, Chandramowlishwaran, Aparna

arXiv.org Artificial Intelligence

Sparse attention is a core building block in many leading neural network models, from graph-structured learning to sparse sequence modeling. It can be decomposed into a sequence of three sparse matrix operations (3S): sampled dense-dense matrix multiplication (SDDMM), softmax normalization, and sparse matrix multiplication (SpMM). Efficiently executing the 3S computational pattern on modern GPUs remains challenging due to (a) the mismatch between unstructured sparsity and tensor cores optimized for dense operations, and (b) the high cost of data movement. Previous works have optimized these sparse operations individually or addressed one of these challenges. This paper introduces Fused3S, the first fused 3S algorithm that jointly maximizes tensor core utilization and minimizes data movement. Across real-world graph datasets, Fused3S achieves $1.6- 16.3\times$ and $1.5-14\times$ speedup over state-of-the-art on H100 and A30 GPUs. Furthermore, integrating Fused3S into Graph Transformer inference accelerates end-to-end performance by $1.05-5.36\times$, consistently outperforming all 3S baselines across diverse datasets (single and batched graphs) and GPU architectures.


Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts

Zhang, Shulai, Zheng, Ningxin, Lin, Haibin, Jiang, Ziheng, Bao, Wenlei, Jiang, Chengquan, Hou, Qi, Cui, Weihao, Zheng, Size, Chang, Li-Wen, Chen, Quan, Liu, Xin

arXiv.org Artificial Intelligence

Mixture-of-experts (MoE) has been extensively employed to scale large language models to trillion-plus parameters while maintaining a fixed computational cost. The development of large MoE models in the distributed scenario encounters the problem of large communication overhead. The inter-device communication of a MoE layer can occupy 47% time of the entire model execution with popular models and frameworks. Therefore, existing methods suggest the communication in a MoE layer to be pipelined with the computation for overlapping. However, these coarse grained overlapping schemes introduce a notable impairment of computational efficiency and the latency concealing is sub-optimal. To this end, we present COMET, an optimized MoE system with fine-grained communication-computation overlapping. Leveraging data dependency analysis and task rescheduling, COMET achieves precise fine-grained overlapping of communication and computation. Through adaptive workload assignment, COMET effectively eliminates fine-grained communication bottlenecks and enhances its adaptability across various scenarios. Our evaluation shows that COMET accelerates the execution of a single MoE layer by $1.96\times$ and for end-to-end execution, COMET delivers a $1.71\times$ speedup on average. COMET has been adopted in the production environment of clusters with ten-thousand-scale of GPUs, achieving savings of millions of GPU hours.


Static Batching of Irregular Workloads on GPUs: Framework and Application to Efficient MoE Model Inference

Li, Yinghan, Li, Yifei, Zhang, Jiejing, Chen, Bujiao, Chen, Xiaotong, Duan, Lian, Jin, Yejun, Li, Zheng, Liu, Xuanyu, Wang, Haoyu, Wang, Wente, Wang, Yajie, Yang, Jiacheng, Zhang, Peiyang, Zheng, Laiwen, Yu, Wenyuan

arXiv.org Artificial Intelligence

Resource utilization is one of the key factors in fully exploiting the computing power of massively parallel devices, including GPUs. As a common method to improve utilization and reduce overhead, the benefit of the batching technique should never be underestimated [7, 8, 11]. In most cases, it is handy to batch regular workloads that share the same type and size, which also have similar amounts of computation and memory access. For example, in the CUDA programming model, this kind of regular workloads can be conveniently batched along an additional thread block or grid dimension [15]. However, irregular workloads do not naturally fit into this scheme. Irregular workloads may show one or more of the following characteristics that prevent regular batching[1]: variable amounts of computation, special memory access patterns, control flow divergence, etc. Moreover, heterogeneous workloads almost raise the difficulty of batching to an unreachable level. Here, by heterogeneous, we refer to workloads of different types of operations, e.g., some of the workloads are reduction, while others are element-wise operations. Irregular workloads are often managed in a task-parallel fashion instead of batching, where an individual workload is regarded as a task, and all tasks are dynamically scheduled [1, 19].


Pushing the Envelope of Low-Bit LLM via Dynamic Error Compensation

Park, Yeonhong, Hyun, Jake, Kim, Hojoon, Lee, Jae W.

arXiv.org Artificial Intelligence

Quantization of Large Language Models (LLMs) has recently gained popularity, particularly for on-device settings with limited hardware resources. While efficient, quantization inevitably degrades model quality, especially in aggressive low-bit settings such as 3-bit and 4-bit precision. In this paper, we propose QDEC, an inference scheme that improves the quality of low-bit LLMs while preserving the key benefits of quantization: GPU memory savings and inference latency reduction. QDEC stores the residual matrix -- the difference between full-precision and quantized weights -- in CPU, and dynamically fetches the residuals for only a small portion of the weights. This portion corresponds to the salient channels, marked by activation outliers, with the fetched residuals helping to correct quantization errors in these channels. Salient channels are identified dynamically at each decoding step by analyzing the input activations -- this allows for the adaptation to the dynamic nature of activation distribution, and thus maximizes the effectiveness of error compensation. We demonstrate the effectiveness of QDEC by augmenting state-of-the-art quantization methods. For example, QDEC reduces the perplexity of a 3-bit Llama-3-8B-Instruct model from 10.15 to 9.12 -- outperforming its 3.5-bit counterpart -- while adding less than 0.0003\% to GPU memory usage and incurring only a 1.7\% inference slowdown on NVIDIA RTX 4050 Mobile GPU. The code will be publicly available soon.


FTuner: A Fast Dynamic Shape Tensors Program Auto-Tuner for Deep Learning Compilers

Mu, Pengyu, Wei, Linquan, Liu, Yi, Wang, Rui

arXiv.org Artificial Intelligence

Many artificial intelligence models process input data of different lengths and resolutions, making the shape of the tensors dynamic. The performance of these models depends on the shape of the tensors, which makes it difficult to optimize the tensors before the model runs. There are two common solutions to this problem. The first is to add useless data to the input to match a pre-optimized tensor library. The second is to use small basic tensors to create a tensor that is closest in size to the input data and then tune it to minimize padding. However, this second solution can be time-consuming. This paper proposes a new technique for deep learning compilers called FTuner. Instead of using a large design space or training a cost model, we use an abstract computational unit called the uKernel to patch together small, various-sized tensors to match the shape of the input tensor. We determine the shape of the uKernel using an analytic hardware information model. Experiments show that the FTuner can achieve comparable operators and end-to-end performance to vendor libraries and achieves 3\% speedup on existing auto-tuner with the model-training compiler while reducing tuning time by two orders of magnitude.


Optimized Speculative Sampling for GPU Hardware Accelerators

Wagner, Dominik, Lee, Seanie, Baumann, Ilja, Seeberger, Philipp, Riedhammer, Korbinian, Bocklet, Tobias

arXiv.org Artificial Intelligence

In this work, we optimize speculative sampling for parallel hardware accelerators to improve sampling speed. We notice that substantial portions of the intermediate matrices necessary for speculative sampling can be computed concurrently. This allows us to distribute the workload across multiple GPU threads, enabling simultaneous operations on matrix segments within thread blocks. Additionally, we use fast on-chip memory to store intermediate results, thereby minimizing the frequency of slow read and write operations across different types of memory. This results in profiling time improvements ranging from 6% to 13% relative to the baseline implementation, without compromising accuracy. To further accelerate speculative sampling, probability distributions parameterized by softmax are approximated by sigmoid. This approximation approach results in significantly greater relative improvements in profiling time, ranging from 37% to 94%, with a slight decline in accuracy. We conduct extensive experiments on both automatic speech recognition and summarization tasks to validate the effectiveness of our optimization methods.