AI's rapid evolution is producing an explosion in new types of hardware accelerators for machine learning and deep learning. Some people refer to this as a "Cambrian explosion," which is an apt metaphor for the current period of fervent innovation. From that point onward, these creatures--ourselves included--fanned out to occupy, exploit, and thoroughly transform every ecological niche on the planet. The range of innovative AI hardware-accelerator architectures continues to expand. Although you may think that graphic processing units (GPUs) are the dominant AI hardware architecture, that is far from the truth.
Google recently released TensorFlow Quantum, a toolset for combining state-of-the-art machine learning techniques with quantum algorithm design. This is an essential step to build tools for developers working on quantum applications. Simultaneously, they have focused on improving quantum computing hardware performance by integrating a set of quantum firmware techniques and building a TensorFlow-based toolset working from the hardware level up – from the bottom of the stack. The fundamental driver for this work is tackling the noise and error in quantum computers. Here's a small overview of the above and how the impact of noise and imperfections (critical challenges) is suppressed in quantum hardware.
The RAPIDS suite of software libraries gives you the freedom to execute end-to-end data science and analytics pipelines entirely on GPUs. It relies on NVIDIA CUDA primitives for low-level compute optimization, but exposes that GPU parallelism and high-bandwidth memory speed through user-friendly Python interfaces. RAPIDS also focuses on common data preparation tasks for analytics and data science. This includes a familiar DataFrame API that integrates with a variety of machine learning algorithms for end-to-end pipeline accelerations without paying typical serialization costs. RAPIDS also includes support for multi-node, multi-GPU deployments, enabling vastly accelerated processing and training on much larger dataset sizes.
This is a pretty active area of research, namely "edge device computing" which often intertwines with "model compression". Using embedded devices that have GPUs such as the Nvidia Jetson TX2 is often a good place to start. This way you can use a smaller GPU that offers CUDA support in an embedded setting. However you must make sure your models are small enough to fit on a device with compute limitations. Frameworks like Tensorflow can train models on a GPU and then you can save the weights, then perform inference elsewhere on a CPU, perhaps you can do something like this on a raspberry pi but keep in mind you will be severly limited on such a device.
Machine learning is playing an increasingly significant role in emerging mobile application domains such as AR/VR, ADAS, etc. Accordingly, hardware architects have designed customized hardware for machine learning algorithms, especially neural networks, to improve compute efficiency. However, machine learning is typically just one processing stage in complex end-to-end applications, which involve multiple components in a mobile Systems-on-a-chip (SoC). Focusing on just ML accelerators loses bigger optimization opportunity at the system (SoC) level. This paper argues that hardware architects should expand the optimization scope to the entire SoC. We demonstrate one particular case-study in the domain of continuous computer vision where camera sensor, image signal processor (ISP), memory, and NN accelerator are synergistically co-designed to achieve optimal system-level efficiency.