AI's rapid evolution is producing an explosion in new types of hardware accelerators for machine learning and deep learning. Some people refer to this as a "Cambrian explosion," which is an apt metaphor for the current period of fervent innovation. From that point onward, these creatures--ourselves included--fanned out to occupy, exploit, and thoroughly transform every ecological niche on the planet. The range of innovative AI hardware-accelerator architectures continues to expand. Although you may think that graphic processing units (GPUs) are the dominant AI hardware architecture, that is far from the truth.
Nvidia's greatest growth in chips in 2017 was in the AI and cloud-based sectors, which should increase in 2018. This year tech companies will begin moving AI more to the "edge" of access, leveraging trained machine learning software with cloud-based computing, according to a VentureBeat.com The authors, Daniel Li, Principal, and S. Somasegar, Managing Director, predicted four new trends in 2018: Machine learning models will operate outside of the data centers and via phones and personal assistant devices, like Alexa and SIRI to reduce power and bandwidth consumption, reduce latency and ensure privacy. Specialized chips for AI will perform better than all-purpose chips, and computers built to optimize AI are already being designed. Text, voice, gestures and vision will all be used more widely to communicate with systems.
Machine learning is playing an increasingly significant role in emerging mobile application domains such as AR/VR, ADAS, etc. Accordingly, hardware architects have designed customized hardware for machine learning algorithms, especially neural networks, to improve compute efficiency. However, machine learning is typically just one processing stage in complex end-to-end applications, which involve multiple components in a mobile Systems-on-a-chip (SoC). Focusing on just ML accelerators loses bigger optimization opportunity at the system (SoC) level. This paper argues that hardware architects should expand the optimization scope to the entire SoC. We demonstrate one particular case-study in the domain of continuous computer vision where camera sensor, image signal processor (ISP), memory, and NN accelerator are synergistically co-designed to achieve optimal system-level efficiency.
At the start of last month I sat down to benchmark the new generation of accelerator hardware intended to speed up machine learning inferencing on the edge. So I'd have a rough yardstick for comparison, I also ran the same benchmarks on the Raspberry Pi. Afterwards a lot of people complained that I should have been using TensorFlow Lite on the Raspberry Pi rather than full blown TensorFlow. They were right, it ran a lot faster. Then with the release of the AI2GO framework from Xnor.ai, which uses next generation binary weight models, I looked at the inferencing speeds of these next generation of models in comparison to'traditional' TensorFlow.
This is a pretty active area of research, namely "edge device computing" which often intertwines with "model compression". Using embedded devices that have GPUs such as the Nvidia Jetson TX2 is often a good place to start. This way you can use a smaller GPU that offers CUDA support in an embedded setting. However you must make sure your models are small enough to fit on a device with compute limitations. Frameworks like Tensorflow can train models on a GPU and then you can save the weights, then perform inference elsewhere on a CPU, perhaps you can do something like this on a raspberry pi but keep in mind you will be severly limited on such a device.