Typically, deep learning approaches to voice recognition -- systems that employ layers of neuron-mimicking mathematical functions to parse human speech -- lean on powerful remote servers for bulk of processing. But researchers at the University of Waterloo and startup DarwinAI claim to have pioneered a strategy for designing speech recognition networks that not only achieves state-of-the-art accuracy, but which produces models robust enough to run on low-end smartphones. They describe their method in a paper published on the preprint server Arxiv.org It builds on work by Amazon's Alexa Machine Learning team, which earlier this year developed navigation, temperature control, and music playback algorithms that can be performed locally; Qualcomm, which in May claimed to have created on-device voice recognition models that are 95 percent accurate; Dublin, Ireland startup Voysis, which in September announced an offline WaveNet voice model for mobile devices; and Intel. "In this study, we explore a human-machine collaborative design strategy for building low-footprint [deep neural network] architectures for speech recognition through a marriage of human-driven principled network design prototyping and machine-driven design exploration," the researchers wrote.
Deep learning has been widely adapted to many different problems, such as image classification , speech recognition  and natural language processing , and has demonstrated state-of-the-art results for these problems. Despite the promises, deep neural networks (DNNs) remain challenging to deploy in on-device edge scenarios such as mobile and other consumer devices. Due to the limited computational resources available in such on-device edge scenarios, many recent studies [4, 5, 6, 7] have put greater efforts into designing small, low-footprint deep neural network architectures that are more appropriate for embedded devices. A particularly interesting approach for enabling low-footprint deep neural network architectures is the concept of knowledge distillation , where the performance of a smaller network is significantly improved by leveraging a teacher-student strategy where the smaller network is trained to mimic the behaviour of a larger teacher network. With much of the research around distillation focused on distilling knowledge from larger networks to smaller networks, there is little research focused on leveraging the concept of distillation for distilling knowledge encapsulated in the training data itself into a reduced form. By producing data with reduced data dimension, one can achieve input-efficient deep neural networks with significantly reduced computational costs. In this study, we explore a concept we will call progressive label distillation, where a series of teacher-student network pairs are leveraged to progressively generate distilled training data.
The tremendous potential exhibited by deep learning is often offset by architectural and computational complexity, making widespread deployment a challenge for edge scenarios such as mobile and other consumer devices. To tackle this challenge, we explore the following idea: Can we learn generative machines to automatically generate deep neural networks with efficient network architectures? In this study, we introduce the idea of generative synthesis, which is premised on the intricate interplay between a generator-inquisitor pair that work in tandem to garner insights and learn to generate highly efficient deep neural networks that best satisfies operational requirements. What is most interesting is that, once a generator has been learned through generative synthesis, it can be used to generate not just one but a large variety of different, unique highly efficient deep neural networks that satisfy operational requirements. Experimental results for image classification, semantic segmentation, and object detection tasks illustrate the efficacy of generative synthesis in producing generators that automatically generate highly efficient deep neural networks (which we nickname FermiNets) with higher model efficiency and lower computational costs (reaching >10x more efficient and fewer multiply-accumulate operations than several tested state-of-the-art networks), as well as higher energy efficiency (reaching >4x improvements in image inferences per joule consumed on a Nvidia Tegra X2 mobile processor). As such, generative synthesis can be a powerful, generalized approach for accelerating and improving the building of deep neural networks for on-device edge scenarios.
We explore the application of end-to-end stateless temporal modeling to small-footprint keyword spotting as opposed to recurrent networks that model long-term temporal dependencies using internal states. We propose a model inspired by the recent success of dilated convolutions in sequence modeling applications, allowing to train deeper architectures in resource-constrained configurations. Gated activations and residual connections are also added, following a similar configuration to WaveNet. In addition, we apply a custom target labeling that back-propagates loss from specific frames of interest, therefore yielding higher accuracy and only requiring to detect the end of the keyword. Our experimental results show that our model outperforms a max-pooling loss trained recurrent neural network using LSTM cells, with a significant decrease in false rejection rate. The underlying dataset - "Hey Snips" utterances recorded by over 2.2K different speakers - has been made publicly available to establish an open reference for wake-word detection.
Much of the focus in the design of deep neural networks has been on improving accuracy, leading to more powerful yet highly complex network architectures that are difficult to deploy in practical scenarios, particularly on edge devices such as mobile and other consumer devices, given their high computational and memory requirements. As a result, there has been a recent interest in the design of quantitative metrics for evaluating deep neural networks that accounts for more than just model accuracy as the sole indicator of network performance. In this study, we continue the conversation towards universal metrics for evaluating the performance of deep neural networks for practical usage. In particular, we propose a new balanced metric called NetScore, which is designed specifically to provide a quantitative assessment of the balance between accuracy, computational complexity, and network architecture complexity of a deep neural network. In what is one of the largest comparative analysis between deep neural networks in literature, the NetScore metric, the top-1 accuracy metric, and the popular information density metric were compared across a diverse set of 50 different deep convolutional neural networks for image classification on the ImageNet Large Scale Visual Recognition Challenge (ILSVRC 2012) dataset. The evaluation results across these three metrics for this diverse set of networks are presented in this study to act as a reference guide for practitioners in the field. The proposed NetScore metric, along with the other tested metrics, are by no means perfect, but the hope is to push the conversation towards better universal metrics for evaluating deep neural networks for use in practical scenarios to help guide practitioners in model design.