Accelerating Transformer Inference and Training with 2:4 Activation Sparsity

Haziza, Daniel, Chou, Timothy, Choudhary, Dhruv, Wehrstedt, Luca, Massa, Francisco, Yu, Jiecao, Jeong, Geonhwa, Rao, Supriya, Labatut, Patrick, Cai, Jesse

Mar-20-2025–arXiv.org Artificial Intelligence

In this paper, we demonstrate how to leverage 2:4 sparsity, a popular hardwareaccelerated GPU sparsity pattern, to activations to accelerate large language model training and inference. Crucially we exploit the intrinsic sparsity found in Squared-ReLU activations to provide this acceleration with no accuracy loss. Our approach achieves up to 1.3x faster Feed Forward Network (FFNs) in both the forwards and backwards pass. This work highlights the potential for sparsity to play a key role in accelerating large language model training and inference. The rapid growth of Large Language Models (LLMs) in recent years has been driven by a corresponding surge in GPU FLOPs.

large language model, machine learning, sparsity, (16 more...)

arXiv.org Artificial Intelligence

Mar-20-2025

arXiv.org PDF

Add feedback

Genre:
- Research Report (0.65)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks (1.00)
  - Natural Language > Large Language Model (1.00)