Baidu, the Chinese company operating a search engine, a mobile browser, and other web services, is announcing today the launch of SwiftScribe, a web app that's meant to help people transcribe audio recordings more quickly, using -- you guessed it! Baidu in the past few years has been honing its DeepSpeech software for speech recognition. Last year, the company introduced TalkType, an Android keyboard that, using DeepSpeech, puts speech input first and typing second, based on the idea that you can enter information more quickly when you say it than when you peck. Now Baidu is coming out with another app enhanced with DeepSpeech, one that could arguably find better footing in a professional setting. Amazon, Apple, Google, and Microsoft have all been working on speech recognition right alongside Baidu, but none of those four has come up with something aimed at longer-form transcription.
The Machine Learning team at Mozilla continues work on DeepSpeech, an automatic speech recognition (ASR) engine which aims to make speech recognition technology and trained models openly available to developers. DeepSpeech is a deep learning-based ASR engine with a simple API. We also provide pre-trained English models. Our latest release, version v0.6, offers the highest quality, most feature-packed model so far. In this overview, we'll show how DeepSpeech can transform your applications by enabling client-side, low-latency, and privacy-preserving speech recognition capabilities.
In the era of voice assistants it was about time for a decent open source effort to show up. The kind folks at Mozilla implemented the Baidu DeepSpeech architecture and published the project on GitHub. Reportedly they achieve quite a low word error rate of 6.5% which is close to the human level. Nope, humans are not 100% accurate! Instead they have a 5.83% word error rate. DeepSpeech, unlike the offerings of Alexa or Google Assistant SDK, runs on-device without requiring any kind of fancy backend or internet connectivity.
Performance of end-to-end neural networks on a given hardware platform is a function of its compute and memory signature, which in-turn, is governed by a wide range of parameters such as topology size, primitives used, framework used, batching strategy, latency requirements, precision etc. Current benchmarking tools suffer from limitations such as a) being either too granular like DeepBench  (or) b) mandate a working implementation that is either framework specific or hardware-architecture specific or both (or) c) provide only high level benchmark metrics. In this paper, we present NTP (Neural Net Topology Profiler), a sophisticated benchmarking framework, to effectively identify memory and compute signature of an end-to-end topology on multiple hardware architectures, without the need for an actual implementation. NTP is tightly integrated with hardware specific benchmarking tools to enable exhaustive data collection and analysis. Using NTP, a deep learning researcher can quickly establish baselines needed to understand performance of an end-to-end neural network topology and make high level architectural decisions. Further, integration of NTP with frameworks like Tensorflow, Pytorch, Intel OpenVINO etc. allows for performance comparison along several vectors like a) Comparison of different frameworks on a given hardware b) Comparison of different hardware using a given framework c) Comparison across different heterogeneous hardware configurations for given framework etc. These capabilities empower a researcher to effortlessly make architectural decisions needed for achieving optimized performance on any hardware platform. The paper documents the architectural approach of NTP and demonstrates the capabilities of the tool by benchmarking Mozilla DeepSpeech, a popular Speech Recognition topology.
What could you or I do without Google's legions of ace AI programmers and racks of neural network training hardware? Let's look at the ways we can make a natural language bot of our own. As you'll see, it's entirely doable. One of the first steps in engineering a solution is to break it down into smaller steps. Any conversation consists of a back-and-forth between two people, or a person and a chunk of silicon in our case.