The Machine Learning team at Mozilla continues work on DeepSpeech, an automatic speech recognition (ASR) engine which aims to make speech recognition technology and trained models openly available to developers. DeepSpeech is a deep learning-based ASR engine with a simple API. We also provide pre-trained English models. Our latest release, version v0.6, offers the highest quality, most feature-packed model so far. In this overview, we'll show how DeepSpeech can transform your applications by enabling client-side, low-latency, and privacy-preserving speech recognition capabilities.
OpenSeq2Seq main goal is to allow researchers to most effectively explore various sequence-to-sequence models. The efficiency is achieved by fully supporting distributed and mixed-precision training. OpenSeq2Seq is built using TensorFlow and provides all the necessary building blocks for training encoder-decoder models for neural machine translation, automatic speech recognition, speech synthesis, and language modeling. Speech-to-text workflow uses some parts of Mozilla DeepSpeech project. Beam search decoder with language model re-scoring implementation (in decoders) is based on Baidu DeepSpeech.
Artie, a startup developing a platform for mobile games on social media, today released a data set and tool for detecting demographic bias in voice apps. The Artie Bias Corpus (ABC), which consists of audio files along with their transcriptions, aims to diagnose and mitigate the impact of factors like age, gender, and accent in voice recognition systems. Speech recognition has come a long way since IBM's Shoebox machine and Worlds of Wonder's Julie doll. But despite progress made possible by AI, voice recognition systems today are at best imperfect -- and at worst discriminatory. In a study commissioned by the Washington Post, popular smart speakers made by Google and Amazon were 30% less likely to understand non-American accents than those of native-born users. More recently, the Algorithmic Justice League's Voice Erasure project found that that speech recognition systems from Apple, Amazon, Google, IBM, and Microsoft collectively achieve word error rates of 35% for African American voices versus 19% for white voices.
In the era of voice assistants it was about time for a decent open source effort to show up. The kind folks at Mozilla implemented the Baidu DeepSpeech architecture and published the project on GitHub. Reportedly they achieve quite a low word error rate of 6.5% which is close to the human level. Nope, humans are not 100% accurate! Instead they have a 5.83% word error rate. DeepSpeech, unlike the offerings of Alexa or Google Assistant SDK, runs on-device without requiring any kind of fancy backend or internet connectivity.