The Cambrian Explosion of Mixed-Precision Matrix Multiplication for Quantized Deep Learning Inference

Martínez, Héctor, Castelló, Adrián, Igual, Francisco D., Quintana-Ortí, Enrique S.

arXiv.org Artificial Intelligence 

Recent advances in deep learning (DL) have led to a shift from traditional 64-bit floating point (FP64) computations toward reduced-precision formats, such as FP16, BF16, and 8- or 16-bit integers, combined with mixed-precision arithmetic. This transition enhances computational throughput, reduces memory and bandwidth usage, and improves energy efficiency, offering significant advantages for resource-constrained edge devices. To support this shift, hardware architectures have evolved accordingly, now including adapted ISAs (Instruction Set Architectures) that expose mixed-precision vector units and matrix engines tailored for DL workloads. At the heart of many DL and scientific computing tasks is the general matrix-matrix multiplication gemm, a fundamental kernel historically optimized using axpy vector instructions on SIMD (single instruction, multiple data) units. However, as hardware moves toward mixed-precision dot-product-centric operations optimized for quantized inference, these legacy approaches are being phased out. In response to this, our paper revisits traditional high-performance gemm and describes strategies for adapting it to mixed-precision integer (MIP) arithmetic across modern ISAs, including x86_64, ARM, and RISC-V. Concretely, we illustrate novel micro-kernel designs and data layouts that better exploit today's specialized hardware and demonstrate significant performance gains from MIP arithmetic over floating-point implementations across three representative CPU architectures. These contributions highlight a new era of gemm optimization-driven by the demands of DL inference on heterogeneous architectures, marking what we term as the "Cambrian period" for matrix multiplication.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found