AskariHemmat, MohammadHossein
QGen: On the Ability to Generalize in Quantization Aware Training
AskariHemmat, MohammadHossein, Jeddi, Ahmadreza, Hemmat, Reyhane Askari, Lazarevich, Ivan, Hoffman, Alexander, Sah, Sudhakar, Saboori, Ehsan, Savaria, Yvon, David, Jean-Pierre
Quantization lowers memory usage, computational requirements, and latency by utilizing fewer bits to represent model weights and activations. In this work, we investigate the generalization properties of quantized neural networks, a characteristic that has received little attention despite its implications on model performance. In particular, first, we develop a theoretical model for quantization in neural networks and demonstrate how quantization functions as a form of regularization. Second, motivated by recent work connecting the sharpness of the loss landscape and generalization, we derive an approximate bound for the generalization of quantized models conditioned on the amount of quantization noise. We then validate our hypothesis by experimenting with over 2000 models trained on CIFAR-10, CIFAR-100, and ImageNet datasets on convolutional and transformer-based models.
DeepliteRT: Computer Vision at the Edge
Ashfaq, Saad, Hoffman, Alexander, Mitra, Saptarshi, Sah, Sudhakar, AskariHemmat, MohammadHossein, Saboori, Ehsan
The proliferation of edge devices has unlocked unprecedented opportunities for deep learning model deployment in computer vision applications. However, these complex models require considerable power, memory and compute resources that are typically not available on edge platforms. Ultra low-bit quantization presents an attractive solution to this problem by scaling down the model weights and activations from 32-bit to less than 8-bit. We implement highly optimized ultra low-bit convolution operators for ARM-based targets that outperform existing methods by up to 4.34x. Our operator is implemented within Deeplite Runtime (DeepliteRT), an end-to-end solution for the compilation, tuning, and inference of ultra low-bit models on ARM devices. Compiler passes in DeepliteRT automatically convert a fake-quantized model in full precision to a compact ultra low-bit representation, easing the process of quantized model deployment on commodity hardware. We analyze the performance of DeepliteRT on classification and detection models against optimized 32-bit floating-point, 8-bit integer, and 2-bit baselines, achieving significant speedups of up to 2.20x, 2.33x and 2.17x, respectively.
DeepGEMM: Accelerated Ultra Low-Precision Inference on CPU Architectures using Lookup Tables
Ganji, Darshan C., Ashfaq, Saad, Saboori, Ehsan, Sah, Sudhakar, Mitra, Saptarshi, AskariHemmat, MohammadHossein, Hoffman, Alexander, Hassanien, Ahmed, Léonardon, Mathieu
Quantization methods such as Learned Step Size ResNet34 74.1% 74.1% 72.4% Quantization can achieve model accuracy that is comparable ResNet50 76.9% 76.8% 74.6% to full-precision floating-point baselines even with subbyte VGG16 73.4% 73.5% 71.4% quantization. However, it is extremely challenging to deploy these ultra low-bit quantized models on mainstream CPU devices because commodity SIMD (Single Instruction, line, but achieving low latency inference with ultra low-bit Multiple Data) hardware typically supports no less than models on general purpose processors (GPPs) remains an 8-bit precision. To overcome this limitation, we propose active area of research [8, 11, 19]. DeepGEMM, a lookup table based approach for the execution Deep learning workloads on CPUs are typically accelerated of ultra low-precision convolutional neural networks by exploiting data-level parallelism through SIMD on SIMD hardware. The proposed method precomputes all programming. However, ultra low-bit deep learning operators possible products of weights and activations, stores them in can not be efficiently executed on these devices because a lookup table, and efficiently accesses them at inference sub-8-bit instructions are not generally supported in time to avoid costly multiply-accumulate operations. Our the vectorized instruction sets of mainstream CPU architectures 2-bit implementation outperforms corresponding 8-bit integer including SSE/AVX instructions on x86 and Neon instructions kernels in the QNNPACK framework by up to 1.74 on on Arm. Therefore, to enable ultra low-precision x86 platforms.
Deeplite Neutrino: An End-to-End Framework for Constrained Deep Learning Model Optimization
Sankaran, Anush, Mastropietro, Olivier, Saboori, Ehsan, Idris, Yasser, Sawyer, Davis, AskariHemmat, MohammadHossein, Hacene, Ghouthi Boukli
Designing deep learning-based solutions is becoming a race for training deeper models with a greater number of layers. While a large-size deeper model could provide competitive accuracy, it creates a lot of logistical challenges and unreasonable resource requirements during development and deployment. This has been one of the key reasons for deep learning models not being excessively used in various production environments, especially in edge devices. There is an immediate requirement for optimizing and compressing these deep learning models, to enable on-device intelligence. In this research, we introduce a black-box framework, Deeplite Neutrino for production-ready optimization of deep learning models. The framework provides an easy mechanism for the end-users to provide constraints such as a tolerable drop in accuracy or target size of the optimized models, to guide the whole optimization process. The framework is easy to include in an existing production pipeline and is available as a Python Package, supporting PyTorch and Tensorflow libraries. The optimization performance of the framework is shown across multiple benchmark datasets and popular deep learning models. Further, the framework is currently used in production and the results and testimonials from several clients are summarized.
U-Net Fixed-Point Quantization for Medical Image Segmentation
AskariHemmat, MohammadHossein, Honari, Sina, Rouhier, Lucas, Perone, Christian S., Cohen-Adad, Julien, Savaria, Yvon, David, Jean-Pierre
Model quantization is leveraged to reduce the memory consumption and the computation time of deep neural networks. This is achieved by representing weights and activations with a lower bit resolution when compared to their high precision floating point counterparts. The suitable level of quantization is directly related to the model performance. Lowering the quantization precision (e.g. 2 bits), reduces the amount of memory required to store model parameters and the amount of logic required to implement computational blocks, which contributes to reducing the power consumption of the entire system. These benefits typically come at the cost of reduced accuracy. The main challenge is to quantize a network as much as possible, while maintaining the performance accuracy. In this work, we present a quantization method for the U-Net architecture, a popular model in medical image segmentation. We then apply our quantization algorithm to three datasets: (1) the Spinal Cord Gray Matter Segmentation (GM), (2) the ISBI challenge for segmentation of neuronal structures in Electron Microscopic (EM), and (3) the public National Institute of Health (NIH) dataset for pancreas segmentation in abdominal CT scans. The reported results demonstrate that with only 4 bits for weights and 6 bits for activations, we obtain 8 fold reduction in memory requirements while loosing only 2.21%, 0.57% and 2.09% dice overlap score for EM, GM and NIH datasets respectively. Our fixed point quantization provides a flexible trade off between accuracy and memory requirement which is not provided by previous quantization methods for U-Net such as TernaryNet.