Researchers at Facebook AI recently introduced and open-sourced a new framework for self-supervised learning of representations from raw audio data known as wav2vec 2.0. The company claims that this framework can enable automatic speech recognition models with just 10 minutes of transcribed speech data. Neural network models have gained much traction over the last few years due to its applications across various sectors. The models work with the help of vast quantities of labelled training data. However, most of the time, it is challenging to gather labelled data than unlabelled data.
Speech-to-text applications have never been so plentiful, popular or powerful, with researchers' pursuit of ever-better automatic speech recognition (ASR) system performance bearing fruit thanks to huge advances in machine learning technologies and the increasing availability of large speech datasets. Current speech recognition systems require thousands of hours of transcribed speech to reach acceptable performance. However, a lack of transcribed audio data for the less widely spoken of the world's 7,000 languages and dialects makes it difficult to train robust speech recognition systems in this area. To help ASR development for such low-resource languages and dialects, Facebook AI researchers have open-sourced the new wav2vec 2.0 algorithm for self-supervised language learning. The paper Wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations claims to "show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being conceptually simpler." A Facebook AI tweet says the new algorithm can enable automatic speech recognition models with just 10 minutes of transcribed speech data.
We propose a self-supervised representation learning model for the task of unsupervised phoneme boundary detection. The model is a convolutional neural network that operates directly on the raw waveform. It is optimized to identify spectral changes in the signal using the Noise-Contrastive Estimation principle. At test time, a peak detection algorithm is applied over the model outputs to produce the final boundaries. As such, the proposed model is trained in a fully unsupervised manner with no manual annotations in the form of target boundaries nor phonetic transcriptions. We compare the proposed approach to several unsupervised baselines using both TIMIT and Buckeye corpora. Results suggest that our approach surpasses the baseline models and reaches state-of-the-art performance on both data sets. Furthermore, we experimented with expanding the training set with additional examples from the Librispeech corpus. We evaluated the resulting model on distributions and languages that were not seen during the training phase (English, Hebrew and German) and showed that utilizing additional untranscribed data is beneficial for model performance.
Learning good representations without supervision is still an open issue in machine learning, and is particularly challenging for speech signals, which are often characterized by long sequences with a complex hierarchical structure. Some recent works, however, have shown that it is possible to derive useful speech representations by employing a self-supervised encoder-discriminator approach. This paper proposes an improved self-supervised method, where a single neural encoder is followed by multiple workers that jointly solve different self-supervised tasks. The needed consensus across different tasks naturally imposes meaningful constraints to the encoder, contributing to discover general representations and to minimize the risk of learning superficial ones. Experiments show that the proposed approach can learn transferable, robust, and problem-agnostic features that carry on relevant information from the speech signal, such as speaker identity, phonemes, and even higher-level features such as emotional cues. In addition, a number of design choices make the encoder easily exportable, facilitating its direct usage or adaptation to different problems.
Facebook researchers have developed what they claim is the largest automatic speech recognition (ASR) model of its kind -- a model that learned to understand words in 51 languages after training on over 16,000 hours of voice recordings. In a paper published on the preprint server Arxiv.org, the coauthors say the system, which contains around a billion parameters, improves speech recognition performance up to 28.8% on one benchmark compared with baselines. Designing a single model to recognize speech in multiple languages is desirable for several reasons. It simplifies the backend production pipeline, for one thing, and studies have shown training multilingual models on similar languages can decrease overall word error rate (WER). Facebook's model -- a so-called joint sequence-to-sequence (Seq2Seq) model -- was trained while sharing the parameters from an encoder, decoder, and token set across all languages. The encoder maps input audio sequences to intermediate representations while the decoder maps the representations to output text, and the token set simplifies the process of working with many languages by sampling sentences at different frequencies.