DeepGEMM: Accelerated Ultra Low-Precision Inference on CPU Architectures using Lookup Tables

Ganji, Darshan C., Ashfaq, Saad, Saboori, Ehsan, Sah, Sudhakar, Mitra, Saptarshi, AskariHemmat, MohammadHossein, Hoffman, Alexander, Hassanien, Ahmed, Léonardon, Mathieu

arXiv.org Artificial Intelligence 

Quantization methods such as Learned Step Size ResNet34 74.1% 74.1% 72.4% Quantization can achieve model accuracy that is comparable ResNet50 76.9% 76.8% 74.6% to full-precision floating-point baselines even with subbyte VGG16 73.4% 73.5% 71.4% quantization. However, it is extremely challenging to deploy these ultra low-bit quantized models on mainstream CPU devices because commodity SIMD (Single Instruction, line, but achieving low latency inference with ultra low-bit Multiple Data) hardware typically supports no less than models on general purpose processors (GPPs) remains an 8-bit precision. To overcome this limitation, we propose active area of research [8, 11, 19]. DeepGEMM, a lookup table based approach for the execution Deep learning workloads on CPUs are typically accelerated of ultra low-precision convolutional neural networks by exploiting data-level parallelism through SIMD on SIMD hardware. The proposed method precomputes all programming. However, ultra low-bit deep learning operators possible products of weights and activations, stores them in can not be efficiently executed on these devices because a lookup table, and efficiently accesses them at inference sub-8-bit instructions are not generally supported in time to avoid costly multiply-accumulate operations. Our the vectorized instruction sets of mainstream CPU architectures 2-bit implementation outperforms corresponding 8-bit integer including SSE/AVX instructions on x86 and Neon instructions kernels in the QNNPACK framework by up to 1.74 on on Arm. Therefore, to enable ultra low-precision x86 platforms.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found