Researchers at Facebook AI recently introduced and open-sourced a new framework for self-supervised learning of representations from raw audio data known as wav2vec 2.0. The company claims that this framework can enable automatic speech recognition models with just 10 minutes of transcribed speech data. Neural network models have gained much traction over the last few years due to its applications across various sectors. The models work with the help of vast quantities of labelled training data. However, most of the time, it is challenging to gather labelled data than unlabelled data.
Automatic speech recognition, or ASR, is a foundational part of not only assistants like Apple's Siri, but dictation software such as Nuance's Dragon and customer support platforms like Google's Contact Center AI. It's the thing that enables machines to parse utterances for key phrases and words and that allows them to distinguish people by their intonations and pitches. Perhaps it goes without saying that ASR is an intense area of study for Facebook, whose conversational tech is used to power Portal's speech recognition and who is broadening the use of AI to classify content on its platform. To this end, at the InterSpeech conference earlier this year the Menlo Park company detailed wave2vec, a novel machine learning algorithm that improves ASR accuracy by using raw, untranscribed audio as training data. Facebook claims it achieves state-of-the-art results on a popular benchmark while using two orders of magnitude less training data and that it demonstrates a 22% error reduction over the leading character-based speech recognition system, Deep Speech 2. Wav2vec was made available earlier this year as an extension to the open source modeling toolkit fairseq, and Facebook says it plans to use wav2vec to provide better audio data representations for keyword spotting and acoustic event detection.
The ongoing success of deep learning techniques depends on the quality of the representations automatically discovered from data 1. These representations must capture important underlying structures from the raw input, e.g., intermediate concepts, features, or latent variables that are useful for the downstream task. While supervised learning using large annotated corpora can leverage useful representations, collecting large amounts of annotated examples is costly, time-consuming, and not always feasible. This is particularly problematic for a large variety of applications. In the speech domain, for instance, there are many low-resource languages, where the progress is dramatically slower than in high-resource languages such as English.
Facebook researchers have developed what they claim is the largest automatic speech recognition (ASR) model of its kind -- a model that learned to understand words in 51 languages after training on over 16,000 hours of voice recordings. In a paper published on the preprint server Arxiv.org, the coauthors say the system, which contains around a billion parameters, improves speech recognition performance up to 28.8% on one benchmark compared with baselines. Designing a single model to recognize speech in multiple languages is desirable for several reasons. It simplifies the backend production pipeline, for one thing, and studies have shown training multilingual models on similar languages can decrease overall word error rate (WER). Facebook's model -- a so-called joint sequence-to-sequence (Seq2Seq) model -- was trained while sharing the parameters from an encoder, decoder, and token set across all languages. The encoder maps input audio sequences to intermediate representations while the decoder maps the representations to output text, and the token set simplifies the process of working with many languages by sampling sentences at different frequencies.
Learning good representations without supervision is still an open issue in machine learning, and is particularly challenging for speech signals, which are often characterized by long sequences with a complex hierarchical structure. Some recent works, however, have shown that it is possible to derive useful speech representations by employing a self-supervised encoder-discriminator approach. This paper proposes an improved self-supervised method, where a single neural encoder is followed by multiple workers that jointly solve different self-supervised tasks. The needed consensus across different tasks naturally imposes meaningful constraints to the encoder, contributing to discover general representations and to minimize the risk of learning superficial ones. Experiments show that the proposed approach can learn transferable, robust, and problem-agnostic features that carry on relevant information from the speech signal, such as speaker identity, phonemes, and even higher-level features such as emotional cues. In addition, a number of design choices make the encoder easily exportable, facilitating its direct usage or adaptation to different problems.