AV-CPL: Continuous Pseudo-Labeling for Audio-Visual Speech Recognition

Rouditchenko, Andrew, Collobert, Ronan, Likhomanenko, Tatiana

arXiv.org Machine Learning 

Audio-visual speech contains synchronized audio and visual information that provides cross-modal supervision to learn representations for both automatic speech recognition (ASR) and visual speech recognition (VSR). We introduce continuous pseudo-labeling for audio-visual speech recognition (AV-CPL), a semi-supervised method to train an audio-visual speech recognition (AVSR) model on a combination of labeled and unlabeled videos with continuously regenerated pseudo-labels. Our models are trained for speech recognition from audio-visual inputs and can perform speech recognition using both audio and visual modalities, or only one modality. Our method uses the same audio-visual model for both supervised training and pseudo-label generation, mitigating the need for external speech recognition models to generate pseudo-labels. Finally, using visual-only speech data, our method is able to leverage unlabeled visual speech to improve VSR. Machine learning has enabled rapid advancement in fields such as speech processing. However, speech processing requires large amounts of labeled data to work well (Radford et al., 2023; Zheng et al., 2022), which is hard to acquire for the thousands of languages spoken world-wide. Semisupervised learning aims to mitigate this challenge by using unlabeled data to learn better representations and improve performance on labeled data. Real-world unlabeled data is often multi-modal, for example, videos containing synchronized audio and visual information. In this work, we investigate whether we can use such multi-modal data in a semi-supervised pipeline to improve performance on labeled data. Multi-modal data has an additional benefit - modalities can be complementary for each other and provide cross-modal supervision, which influences our algorithm design. In this work, we study audio-visual speech as multi-modal data with synchronized audio and visual input sequences. Using only the audio or the video data, we can perform two kinds of speech recognition: automatic speech recognition (ASR) from the audio channel, or visual speech recognition (VSR) from the video channel (lip-reading). However, these modalities require substantially different amounts of labeled data for training practical models. For example, with 30 hours of labeled data, we can train an ASR model which reaches around 11% word error rate (WER), while training modern end-to-end VSR models on the same amount of data is challenging: the lowest WER we achieve in our experiments is 96%.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found