To Compress, or Not to Compress: Characterizing Deep Learning Model Compression for Embedded Inference Machine Learning

The recent advances in deep neural networks (DNNs) make them attractive for embedded systems. However, it can take a long time for DNNs to make an inference on resource-constrained computing devices. Model compression techniques can address the computation issue of deep inference on embedded devices. This technique is highly attractive, as it does not rely on specialized hardware, or computation-offloading that is often infeasible due to privacy concerns or high latency. However, it remains unclear how model compression techniques perform across a wide range of DNNs. To design efficient embedded deep learning solutions, we need to understand their behaviors. This work develops a quantitative approach to characterize model compression techniques on a representative embedded deep learning architecture, the NVIDIA Jetson Tx2. We perform extensive experiments by considering 11 influential neural network architectures from the image classification and the natural language processing domains. We experimentally show that how two mainstream compression techniques, data quantization and pruning, perform on these network architectures and the implications of compression techniques to the model storage size, inference time, energy consumption and performance metrics. We demonstrate that there are opportunities to achieve fast deep inference on embedded systems, but one must carefully choose the compression settings. Our results provide insights on when and how to apply model compression techniques and guidelines for designing efficient embedded deep learning systems.

How TensorFlow Lite Optimizes Neural Networks for Mobile Machine Learning


The steady rise of mobile Internet traffic has provoked a parallel increase in demand for on-device intelligence capabilities. However, the inherent scarcity of resources at the Edge means that satisfying this demand will require creative solutions to old problems. How do you run computationally expensive operations on a device that has limited processing capability without it turning into magma in your hand? The addition of TensorFlow Lite to the TensorFlow ecosystem provides us with the next step forward in machine learning capabilities, allowing us to harness the power of TensorFlow models on mobile and embedded devices while maintaining low latency, efficient runtimes, and accurate inference. TensorFlow Lite provides the framework for a trained TensorFlow model to be compressed and deployed to a mobile or embedded application.

Microsoft wants to bring AI to Raspberry Pi and other tiny devices ZDNet


Microsoft has released the Embedded Learning Library, offering developers a pre-trained image recognition model for Raspberry Pi and other developer boards. The early preview of Embedded Learning Library (ELL), now available on GitHub, is part of Microsoft's effort to miniaturize its machine-learning software for a range of extremely low-powered chips on devices that aren't connected to the cloud. As the company explains in a blogpost, a team at the Microsoft Research lab is working on compressing its machine learning models to work on the Cortex-M0, an ARM processor no bigger than breadcrumb. The aim is to to push machine learning to devices that aren't connected to the internet, such as brain implants. Microsoft's new art feature for its Pix iPhone photo app uses AI on the device, but the plan is to enable it on much less powerful chips, such as a brain implant, which might need to work without a network connection.

Introduction to TensorFlow Lite TensorFlow


TensorFlow Lite is TensorFlow's lightweight solution for mobile and embedded devices. It enables on-device machine learning inference with low latency and a small binary size. TensorFlow Lite also supports hardware acceleration with the Android Neural Networks API. TensorFlow Lite uses many techniques for achieving low latency such as optimizing the kernels for mobile apps, pre-fused activations, and quantized kernels that allow smaller and faster (fixed-point math) models. Most of our TensorFlow Lite documentation is on Github for the time being.

Build AI that works offline with Coral Dev Board, Edge TPU, and TensorFlow Lite


These new devices are made by Coral, Google's new platform for enabling embedded developers to build amazing experiences with local AI. Coral's first products are powered by Google's Edge TPU chip, and are purpose-built to run TensorFlow Lite, TensorFlow's lightweight solution for mobile and embedded devices. As a developer, you can use Coral devices to explore and prototype new applications for on-device machine learning inference. Coral's Dev Board is a single-board Linux computer with a removable System-On-Module (SOM) hosting the Edge TPU. It allows you to prototype applications and then scale to production by including the SOM in your own devices.