A Flexible Instruction Set Architecture for Efficient GEMMs

Santana, Alexandre de Limas, Armejach, Adrià, Martinez, Francesc, Focht, Erich, Casas, Marc

Jul-8-2025–arXiv.org Artificial Intelligence

--GEneral Matrix Multiplications (GEMMs) are recurrent in high-performance computing and deep learning workloads. Typically, high-end CPUs accelerate GEMM workloads with Single-Instruction Multiple Data (SIMD) or vector Instruction Set Architectures (ISAs). Since these ISAs face significant issues when running GEMM workloads, particularly when dealing with small, tall, or skinny matrices, matrix ISAs have been proposed and implemented by major hardware vendors in the last years. Although these matrix ISAs deliver larger throughput when running GEMMs than their SIMD/vector counterparts, they are rigid solutions unable to dynamically adapt themselves to application-specific aspects like the data format. This paper demonstrates that the state-of-the-art matrix ISAs deliver sub-optimal performance when running the most commonly used convolution and transformer models. This paper proposes the Matrix Tile Extension (MTE), the first matrix ISA that completely decouples the instruction set architecture from the microarchitecture and seamlessly interacts with existing vector ISAs. MTE incurs minimal implementation overhead since it only requires a few additional instructions and a 64-bit Control Status Register (CSR) to keep its state. Specifically, MTE can i) vectorize GEMMs across the three dimensions M, N, and K; ii) leverage the capacity of the existing vector register file; and iii) decouple the tile shape from the underlying microarchitecture. MTE achieves speed-ups of 1.35 over the best state-of-the-art matrix ISA. GEneral Matrix Multiplications (GEMMs) are ubiquitous in high-performance computing and deep learning workloads [1]. Typically, high-end CPUs accelerate GEMM workloads with Single-Instruction Multiple Data (SIMD) or vector Instruction Set Architectures (ISAs) [2], [3], [4] to leverage their high-throughput floating-point functional units. In a push for even higher throughput, major hardware vendors are now incorporating matrix ISAs into CPU architectures with the first implementations released in the last years [5], [6], [7], [8]. These approaches achieve better compute throughput than SIMD/vector ISAs by exploiting specialized Matrix-Multiply-Accumulate (MMA) units.

artificial intelligence, machine learning, natural language, (22 more...)

arXiv.org Artificial Intelligence

Jul-8-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - New Mexico > Bernalillo County > Albuquerque (0.04)
- Europe > Germany
  - Hamburg (0.04)

Genre:
- Research Report (0.64)

Industry:
- Information Technology (0.46)

Technology:
- Information Technology
  - Scientific Computing (1.00)
  - Artificial Intelligence
    - Natural Language (1.00)
    - Representation & Reasoning (0.93)
    - Machine Learning > Neural Networks
      - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found