VEXP: A Low-Cost RISC-V ISA Extension for Accelerated Softmax Computation in Transformers

Wang, Run, Islamoglu, Gamze, Belano, Andrea, Potocnik, Viviane, Conti, Francesco, Garofalo, Angelo, Benini, Luca

Apr-16-2025–arXiv.org Artificial Intelligence

--While Transformers are dominated by Floating-Point (FP) Matrix-Multiplications, their aggressive acceleration through dedicated hardware or many-core programmable systems has shifted the performance bottleneck to non-linear functions like Softmax. Accelerating Softmax is challenging due to its non-pointwise, non-linear nature, with exponentiation as the most demanding step. T o address this, we design a custom arithmetic block for Bfloat16 exponentiation leveraging a novel approximation algorithm based on Schraudolph's method, and we integrate it into the Floating-Point Unit (FPU) of the RISC-V cores [1] of a compute cluster, through custom Instruction Set Architecture (ISA) extensions, with a negligible area overhead of 1 %. By optimizing the software kernels to leverage the extension, we execute Softmax with 162.7 less latency and 74.3 less energy compared to the baseline cluster, achieving an 8.2 performance improvement and 4.1 higher energy efficiency for the FlashAttention-2 kernel in GPT -2 configuration. Moreover, the proposed approach enables a multi-cluster system to efficiently execute end-to-end inference of pre-trained Transformer models, such as GPT -2, GPT -3 and ViT, achieving up to 5.8 and 3.6 reduction in latency and energy consumption, respectively, without requiring re-training and with negligible accuracy loss. Transformer-based models such as the GPT family [2] and the LLaMa family [3], have emerged as a cornerstone of machine learning, demonstrating state-of-the-art performance in diverse domains, including natural language processing (NLP), computer vision, and audio processing. At the core of their success is the Transformer architecture [4], which utilizes the self-attention mechanism to model complex relationships within input sequences. In encoders and the prefill stage of decoders, the computational complexity of attention layers scales quadratically with the input sequence length, leading to memory and computational overheads that necessitate mitigation by means of dedicated acceleration. This work was supported by the NeuroSoC project, funded under the European Union's Horizon Europe research and innovation programme (Grant Agreement No. 101070634). For each sequence length, the left bar shows unoptimized GEMM results, while the right bar reflects optimized GEMM results.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

Apr-16-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States (0.28)
- Europe > Switzerland (0.28)

Genre:
- Research Report (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found