Collaborating Authors


DeepSpeech 0.6: Mozilla's Speech-to-Text Engine Gets Fast, Lean, and Ubiquitous – Mozilla Hacks - the Web developer blog


The Machine Learning team at Mozilla continues work on DeepSpeech, an automatic speech recognition (ASR) engine which aims to make speech recognition technology and trained models openly available to developers. DeepSpeech is a deep learning-based ASR engine with a simple API. We also provide pre-trained English models. Our latest release, version v0.6, offers the highest quality, most feature-packed model so far. In this overview, we'll show how DeepSpeech can transform your applications by enabling client-side, low-latency, and privacy-preserving speech recognition capabilities.

Mozilla updates DeepSpeech with an English language model that runs 'faster than real time'


DeepSpeech, a suite of speech-to-text and text-to-speech engines maintained by Mozilla's Machine Learning Group, this morning received an update (to version 0.6) that incorporates one of the fastest open source speech recognition models to date. In a blog post, senior research engineer Reuben Morais lays out what's new and enhanced, as well as other spotlight features coming down the pipeline. The latest version of DeepSpeech adds support for TensorFlow Lite, a version of Google's TensorFlow machine learning framework that's optimized for compute-constrained mobile and embedded devices. It has reduced DeepSpeech's package size from 98MB to 3.7MB and its built-in English model size -- which has a 7.5% word error rate on a popular benchmark and which was trained on 5,516 hours of transcribed audio from WAMU (NPR), LibriSpeech, Fisher, Switchboard, and Mozilla's Common Voice English data sets -- from 188MB to 47MB. Plus, it has cut down DeepSpeech's memory consumption by 22 times and boosted its startup speed by over 500 times.

NTP : A Neural Network Topology Profiler Artificial Intelligence

Performance of end-to-end neural networks on a given hardware platform is a function of its compute and memory signature, which in-turn, is governed by a wide range of parameters such as topology size, primitives used, framework used, batching strategy, latency requirements, precision etc. Current benchmarking tools suffer from limitations such as a) being either too granular like DeepBench [1] (or) b) mandate a working implementation that is either framework specific or hardware-architecture specific or both (or) c) provide only high level benchmark metrics. In this paper, we present NTP (Neural Net Topology Profiler), a sophisticated benchmarking framework, to effectively identify memory and compute signature of an end-to-end topology on multiple hardware architectures, without the need for an actual implementation. NTP is tightly integrated with hardware specific benchmarking tools to enable exhaustive data collection and analysis. Using NTP, a deep learning researcher can quickly establish baselines needed to understand performance of an end-to-end neural network topology and make high level architectural decisions. Further, integration of NTP with frameworks like Tensorflow, Pytorch, Intel OpenVINO etc. allows for performance comparison along several vectors like a) Comparison of different frameworks on a given hardware b) Comparison of different hardware using a given framework c) Comparison across different heterogeneous hardware configurations for given framework etc. These capabilities empower a researcher to effortlessly make architectural decisions needed for achieving optimized performance on any hardware platform. The paper documents the architectural approach of NTP and demonstrates the capabilities of the tool by benchmarking Mozilla DeepSpeech, a popular Speech Recognition topology.

Make A Natural Language Phone Bot Like Google's Duplex AI


What could you or I do without Google's legions of ace AI programmers and racks of neural network training hardware? Let's look at the ways we can make a natural language bot of our own. As you'll see, it's entirely doable. One of the first steps in engineering a solution is to break it down into smaller steps. Any conversation consists of a back-and-forth between two people, or a person and a chunk of silicon in our case.

DeepSpeech on Windows WSL – Foti Dim's


In the era of voice assistants it was about time for a decent open source effort to show up. The kind folks at Mozilla implemented the Baidu DeepSpeech architecture and published the project on GitHub. Reportedly they achieve quite a low word error rate of 6.5% which is close to the human level. Nope, humans are not 100% accurate! Instead they have a 5.83% word error rate. DeepSpeech, unlike the offerings of Alexa or Google Assistant SDK, runs on-device without requiring any kind of fancy backend or internet connectivity.

Baidu launches SwiftScribe, an app that transcribes audio with AI


Baidu, the Chinese company operating a search engine, a mobile browser, and other web services, is announcing today the launch of SwiftScribe, a web app that's meant to help people transcribe audio recordings more quickly, using -- you guessed it! Baidu in the past few years has been honing its DeepSpeech software for speech recognition. Last year, the company introduced TalkType, an Android keyboard that, using DeepSpeech, puts speech input first and typing second, based on the idea that you can enter information more quickly when you say it than when you peck. Now Baidu is coming out with another app enhanced with DeepSpeech, one that could arguably find better footing in a professional setting. Amazon, Apple, Google, and Microsoft have all been working on speech recognition right alongside Baidu, but none of those four has come up with something aimed at longer-form transcription.