Speech-to-text applications have never been so plentiful, popular or powerful, with researchers' pursuit of ever-better automatic speech recognition (ASR) system performance bearing fruit thanks to huge advances in machine learning technologies and the increasing availability of large speech datasets. Current speech recognition systems require thousands of hours of transcribed speech to reach acceptable performance. However, a lack of transcribed audio data for the less widely spoken of the world's 7,000 languages and dialects makes it difficult to train robust speech recognition systems in this area. To help ASR development for such low-resource languages and dialects, Facebook AI researchers have open-sourced the new wav2vec 2.0 algorithm for self-supervised language learning. The paper Wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations claims to "show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being conceptually simpler." A Facebook AI tweet says the new algorithm can enable automatic speech recognition models with just 10 minutes of transcribed speech data.
Automatic speech recognition, or ASR, is a foundational part of not only assistants like Apple's Siri, but dictation software such as Nuance's Dragon and customer support platforms like Google's Contact Center AI. It's the thing that enables machines to parse utterances for key phrases and words and that allows them to distinguish people by their intonations and pitches. Perhaps it goes without saying that ASR is an intense area of study for Facebook, whose conversational tech is used to power Portal's speech recognition and who is broadening the use of AI to classify content on its platform. To this end, at the InterSpeech conference earlier this year the Menlo Park company detailed wave2vec, a novel machine learning algorithm that improves ASR accuracy by using raw, untranscribed audio as training data. Facebook claims it achieves state-of-the-art results on a popular benchmark while using two orders of magnitude less training data and that it demonstrates a 22% error reduction over the leading character-based speech recognition system, Deep Speech 2. Wav2vec was made available earlier this year as an extension to the open source modeling toolkit fairseq, and Facebook says it plans to use wav2vec to provide better audio data representations for keyword spotting and acoustic event detection.
The Machine Learning team at Mozilla continues work on DeepSpeech, an automatic speech recognition (ASR) engine which aims to make speech recognition technology and trained models openly available to developers. DeepSpeech is a deep learning-based ASR engine with a simple API. We also provide pre-trained English models. Our latest release, version v0.6, offers the highest quality, most feature-packed model so far. In this overview, we'll show how DeepSpeech can transform your applications by enabling client-side, low-latency, and privacy-preserving speech recognition capabilities.
Facebook researchers have developed what they claim is the largest automatic speech recognition (ASR) model of its kind -- a model that learned to understand words in 51 languages after training on over 16,000 hours of voice recordings. In a paper published on the preprint server Arxiv.org, the coauthors say the system, which contains around a billion parameters, improves speech recognition performance up to 28.8% on one benchmark compared with baselines. Designing a single model to recognize speech in multiple languages is desirable for several reasons. It simplifies the backend production pipeline, for one thing, and studies have shown training multilingual models on similar languages can decrease overall word error rate (WER). Facebook's model -- a so-called joint sequence-to-sequence (Seq2Seq) model -- was trained while sharing the parameters from an encoder, decoder, and token set across all languages. The encoder maps input audio sequences to intermediate representations while the decoder maps the representations to output text, and the token set simplifies the process of working with many languages by sampling sentences at different frequencies.
The ongoing success of deep learning techniques depends on the quality of the representations automatically discovered from data 1. These representations must capture important underlying structures from the raw input, e.g., intermediate concepts, features, or latent variables that are useful for the downstream task. While supervised learning using large annotated corpora can leverage useful representations, collecting large amounts of annotated examples is costly, time-consuming, and not always feasible. This is particularly problematic for a large variety of applications. In the speech domain, for instance, there are many low-resource languages, where the progress is dramatically slower than in high-resource languages such as English.