FlashFormer: Whole-Model Kernels for Efficient Low-Batch Inference

Nrusimha, Aniruddha, Brandon, William, Mishra, Mayank, Shen, Yikang, Panda, Rameswar, Ragan-Kelley, Jonathan, Kim, Yoon

Dec-5-2025–arXiv.org Artificial Intelligence

The size and compute characteristics of modern large language models have led to an increased interest in developing specialized kernels tailored for particular training and inference workloads. Existing kernels primarily optimize for compute utilization, targeting the large-batch training and inference settings. However, low-batch inference, where memory bandwidth and kernel launch overheads are significant factors, remains important for many applications of interest such as in edge deployment and latency-sensitive applications. This paper describes FlashFormer, which fuses the entire transformer forward pass into a single kernel for accelerating low-batch inference of large language models. Across various model sizes and quantizations settings, FlashFormer achieves nontrivial speedups compared to existing inference kernels.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

Dec-5-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States > Massachusetts (0.46)

Genre:
- Research Report (0.82)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found