Free Bits: Latency Optimization of Mixed-Precision Quantized Neural Networks on the Edge
Rutishauser, Georg, Conti, Francesco, Benini, Luca
–arXiv.org Artificial Intelligence
Mixed-precision quantization, where a deep neural network's layers are quantized to different precisions, offers the opportunity to optimize the trade-offs between model size, latency, and statistical accuracy beyond what can be achieved with homogeneous-bit-width quantization. To navigate the intractable search space of mixed-precision configurations for a given network, this paper proposes a hybrid search methodology. It consists of a hardware-agnostic differentiable search algorithm followed by a hardware-aware heuristic optimization to find mixed-precision configurations latency-optimized for a specific hardware target. We evaluate our algorithm on MobileNetV1 and MobileNetV2 and deploy the resulting networks on a family of multi-core RISC-V microcontroller platforms with different hardware characteristics. We achieve up to 28.6% reduction of end-to-end latency compared to an 8-bit model at a negligible accuracy drop from a full-precision baseline on the 1000-class ImageNet dataset. We demonstrate speedups relative to an 8-bit baseline, even on systems with no hardware support for sub-byte arithmetic at negligible accuracy drop. Furthermore, we show the superiority of our approach with respect to differentiable search targeting reduced binary operation counts as a proxy for latency.
arXiv.org Artificial Intelligence
Jul-6-2023
- Country:
- Europe
- Italy > Emilia-Romagna
- Metropolitan City of Bologna > Bologna (0.04)
- Switzerland > Zürich
- Zürich (0.14)
- Italy > Emilia-Romagna
- Europe
- Genre:
- Research Report (0.40)
- Technology: