av-hubert
Listening and Seeing Again: Generative Error Correction for Audio-Visual Speech Recognition
Liu, Rui, Yuan, Hongyu, Li, Haizhou
Unlike traditional Automatic Speech Recognition (ASR), Audio-Visual Speech Recognition (AVSR) takes audio and visual signals simultaneously to infer the transcription. Recent studies have shown that Large Language Models (LLMs) can be effectively used for Generative Error Correction (GER) in ASR by predicting the best transcription from ASR-generated N-best hypotheses. However, these LLMs lack the ability to simultaneously understand audio and visual, making the GER approach challenging to apply in AVSR. In this work, we propose a novel GER paradigm for AVSR, termed AVGER, that follows the concept of ``listening and seeing again''. Specifically, we first use the powerful AVSR system to read the audio and visual signals to get the N-Best hypotheses, and then use the Q-former-based Multimodal Synchronous Encoder to read the audio and visual information again and convert them into an audio and video compression representation respectively that can be understood by LLM. Afterward, the audio-visual compression representation and the N-Best hypothesis together constitute a Cross-modal Prompt to guide the LLM in producing the best transcription. In addition, we also proposed a Multi-Level Consistency Constraint training criterion, including logits-level, utterance-level and representations-level, to improve the correction accuracy while enhancing the interpretability of audio and visual compression representations. The experimental results on the LRS3 dataset show that our method outperforms current mainstream AVSR systems. The proposed AVGER can reduce the Word Error Rate (WER) by 24% compared to them. Code and models can be found at: https://github.com/CircleRedRain/AVGER.
- Europe > Austria > Vienna (0.14)
- North America > United States > Hawaii > Honolulu County > Honolulu (0.04)
- Asia > China > Guangdong Province > Shenzhen (0.04)
- (17 more...)
Where Visual Speech Meets Language: VSP-LLM Framework for Efficient and Context-Aware Visual Speech Processing
Yeo, Jeong Hun, Han, Seunghee, Kim, Minsu, Ro, Yong Man
In visual speech processing, context modeling capability is one of the most important requirements due to the ambiguous nature of lip movements. For example, homophenes, words that share identical lip movements but produce different sounds, can be distinguished by considering the context. In this paper, we propose a novel framework, namely Visual Speech Processing incorporated with LLMs (VSP-LLM), to maximize the context modeling ability by bringing the overwhelming power of LLMs. Specifically, VSP-LLM is designed to perform multi-tasks of visual speech recognition and translation, where the given instructions control the type of task. The input video is mapped to the input latent space of an LLM by employing a self-supervised visual speech model. Focused on the fact that there is redundant information in input frames, we propose a novel deduplication method that reduces the embedded visual features by employing visual speech units. Through the proposed deduplication and Low Rank Adaptation (LoRA), VSP-LLM can be trained in a computationally efficient manner. In the translation dataset, the MuAViC benchmark, we demonstrate that VSP-LLM trained on just 30 hours of labeled data can more effectively translate lip movements compared to the recent model trained with 433 hours of data.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Europe > Portugal > Braga > Braga (0.04)
- Asia > Taiwan > Taiwan Province > Taipei (0.04)
- (2 more...)
XLAVS-R: Cross-Lingual Audio-Visual Speech Representation Learning for Noise-Robust Speech Perception
Han, HyoJung, Anwar, Mohamed, Pino, Juan, Hsu, Wei-Ning, Carpuat, Marine, Shi, Bowen, Wang, Changhan
Speech recognition and translation systems perform poorly on noisy inputs, which are frequent in realistic environments. Augmenting these systems with visual signals has the potential to improve robustness to noise. However, audio-visual (AV) data is only available in limited amounts and for fewer languages than audio-only resources. To address this gap, we present XLAVS-R, a cross-lingual audio-visual speech representation model for noise-robust speech recognition and translation in over 100 languages. It is designed to maximize the benefits of limited multilingual AV pre-training data, by building on top of audio-only multilingual pre-training and simplifying existing pre-training schemes. Extensive evaluation on the MuAViC benchmark shows the strength of XLAVS-R on downstream audio-visual speech recognition and translation tasks, where it outperforms the previous state of the art by up to 18.5% WER and 4.7 BLEU given noisy AV inputs, and enables strong zero-shot audio-visual ability with audio-only fine-tuning.
- North America > United States > Pennsylvania (0.04)
- North America > United States > Maryland (0.04)
- Europe > Portugal > Braga > Braga (0.04)
AV2Wav: Diffusion-Based Re-synthesis from Continuous Self-supervised Features for Audio-Visual Speech Enhancement
Chou, Ju-Chieh, Chien, Chung-Ming, Livescu, Karen
Speech enhancement systems are typically trained using pairs of clean and noisy speech. In audio-visual speech enhancement (AVSE), there is not as much ground-truth clean data available; most audio-visual datasets are collected in real-world environments with background noise and reverberation, hampering the development of AVSE. In this work, we introduce AV2Wav, a resynthesis-based audio-visual speech enhancement approach that can generate clean speech despite the challenges of real-world training data. We obtain a subset of nearly clean speech from an audio-visual corpus using a neural quality estimator, and then train a diffusion model on this subset to generate waveforms conditioned on continuous speech representations from AV-HuBERT with noise-robust training. We use continuous rather than discrete representations to retain prosody and speaker information. With this vocoding task alone, the model can perform speech enhancement better than a masking-based baseline. We further fine-tune the diffusion model on clean/noisy utterance pairs to improve the performance. Our approach outperforms a masking-based baseline in terms of both automatic metrics and a human listening test and is close in quality to the target speech in the listening test. Audio samples can be found at https://home.ttic.edu/~jcchou/demo/avse/avse_demo.html.
AV-CPL: Continuous Pseudo-Labeling for Audio-Visual Speech Recognition
Rouditchenko, Andrew, Collobert, Ronan, Likhomanenko, Tatiana
Audio-visual speech contains synchronized audio and visual information that provides cross-modal supervision to learn representations for both automatic speech recognition (ASR) and visual speech recognition (VSR). We introduce continuous pseudo-labeling for audio-visual speech recognition (AV-CPL), a semi-supervised method to train an audio-visual speech recognition (AVSR) model on a combination of labeled and unlabeled videos with continuously regenerated pseudo-labels. Our models are trained for speech recognition from audio-visual inputs and can perform speech recognition using both audio and visual modalities, or only one modality. Our method uses the same audio-visual model for both supervised training and pseudo-label generation, mitigating the need for external speech recognition models to generate pseudo-labels. Finally, using visual-only speech data, our method is able to leverage unlabeled visual speech to improve VSR. Machine learning has enabled rapid advancement in fields such as speech processing. However, speech processing requires large amounts of labeled data to work well (Radford et al., 2023; Zheng et al., 2022), which is hard to acquire for the thousands of languages spoken world-wide. Semisupervised learning aims to mitigate this challenge by using unlabeled data to learn better representations and improve performance on labeled data. Real-world unlabeled data is often multi-modal, for example, videos containing synchronized audio and visual information. In this work, we investigate whether we can use such multi-modal data in a semi-supervised pipeline to improve performance on labeled data. Multi-modal data has an additional benefit - modalities can be complementary for each other and provide cross-modal supervision, which influences our algorithm design. In this work, we study audio-visual speech as multi-modal data with synchronized audio and visual input sequences. Using only the audio or the video data, we can perform two kinds of speech recognition: automatic speech recognition (ASR) from the audio channel, or visual speech recognition (VSR) from the video channel (lip-reading). However, these modalities require substantially different amounts of labeled data for training practical models. For example, with 30 hours of labeled data, we can train an ASR model which reaches around 11% word error rate (WER), while training modern end-to-end VSR models on the same amount of data is challenging: the lowest WER we achieve in our experiments is 96%.
- Europe > Portugal > Braga > Braga (0.04)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- Oceania > Australia > Victoria > Melbourne (0.04)
- (3 more...)
OpenSR: Open-Modality Speech Recognition via Maintaining Multi-Modality Alignment
Cheng, Xize, Jin, Tao, Li, Linjun, Lin, Wang, Duan, Xinyu, Zhao, Zhou
Speech Recognition builds a bridge between the multimedia streaming (audio-only, visual-only or audio-visual) and the corresponding text transcription. However, when training the specific model of new domain, it often gets stuck in the lack of new-domain utterances, especially the labeled visual utterances. To break through this restriction, we attempt to achieve zero-shot modality transfer by maintaining the multi-modality alignment in phoneme space learned with unlabeled multimedia utterances in the high resource domain during the pre-training \cite{shi2022learning}, and propose a training system Open-modality Speech Recognition (\textbf{OpenSR}) that enables the models trained on a single modality (e.g., audio-only) applicable to more modalities (e.g., visual-only and audio-visual). Furthermore, we employ a cluster-based prompt tuning strategy to handle the domain shift for the scenarios with only common words in the new domain utterances. We demonstrate that OpenSR enables modality transfer from one to any in three different settings (zero-, few- and full-shot), and achieves highly competitive zero-shot performance compared to the existing few-shot and full-shot lip-reading methods. To the best of our knowledge, OpenSR achieves the state-of-the-art performance of word error rate in LRS2 on audio-visual speech recognition and lip-reading with 2.7\% and 25.0\%, respectively. The code and demo are available at https://github.com/Exgc/OpenSR.
- Europe > United Kingdom (0.05)
- Europe > Portugal > Braga > Braga (0.04)
- Asia > China (0.04)
- North America > United States > Massachusetts > Middlesex County > Reading (0.04)
Audio-Visual Speech Enhancement with Score-Based Generative Models
Richter, Julius, Frintrop, Simone, Gerkmann, Timo
This paper introduces an audio-visual speech enhancement system that leverages score-based generative models, also known as diffusion models, conditioned on visual information. In particular, we exploit audio-visual embeddings obtained from a self-super\-vised learning model that has been fine-tuned on lipreading. The layer-wise features of its transformer-based encoder are aggregated, time-aligned, and incorporated into the noise conditional score network. Experimental evaluations show that the proposed audio-visual speech enhancement system yields improved speech quality and reduces generative artifacts such as phonetic confusions with respect to the audio-only equivalent. The latter is supported by the word error rate of a downstream automatic speech recognition model, which decreases noticeably, especially at low input signal-to-noise ratios.
- South America > Suriname > North Atlantic Ocean (0.04)
- Europe > Germany > Hamburg (0.04)
Practice of the conformer enhanced AUDIO-VISUAL HUBERT on Mandarin and English
Ren, Xiaoming, Li, Chao, Wang, Shenjian, Li, Biao
Considering the bimodal nature of human speech perception, lips, and teeth movement has a pivotal role in automatic speech recognition. Benefiting from the correlated and noise-invariant visual information, audio-visual recognition systems enhance robustness in multiple scenarios. In previous work, audio-visual HuBERT appears to be the finest practice incorporating modality knowledge. This paper outlines a mixed methodology, named conformer enhanced AV-HuBERT, boosting the AV-HuBERT system's performance a step further. Compared with baseline AV-HuBERT, our method in the one-phase evaluation of clean and noisy conditions achieves 7% and 16% relative WER reduction on the English AVSR benchmark dataset LRS3. Furthermore, we establish a novel 1000h Mandarin AVSR dataset CSTS. On top of the baseline AV-HuBERT, we exceed the WeNet ASR system by 14% and 18% relatively on MISP and CMLR by pre-training with this dataset. The conformer-enhanced AV-HuBERT we proposed brings 7% on MISP and 6% CER reduction on CMLR, compared with the baseline AV-HuBERT system.
AV-data2vec: Self-supervised Learning of Audio-Visual Speech Representations with Contextualized Target Representations
Lian, Jiachen, Baevski, Alexei, Hsu, Wei-Ning, Auli, Michael
Self-supervision has shown great potential for audio-visual speech recognition by vastly reducing the amount of labeled data required to build good systems. However, existing methods are either not entirely end-to-end or do not train joint representations of both modalities. In this paper, we introduce AV-data2vec which addresses these challenges and builds audio-visual representations based on predicting contextualized representations which has been successful in the uni-modal case. The model uses a shared transformer encoder for both audio and video and can combine both modalities to improve speech recognition. Results on LRS3 show that AV-data2vec consistently outperforms existing methods under most settings.
- Europe > Portugal > Braga > Braga (0.04)
- North America > United States > California > Alameda County > Berkeley (0.04)
Robust Self-Supervised Audio-Visual Speech Recognition
Shi, Bowen, Hsu, Wei-Ning, Mohamed, Abdelrahman
Audio-based automatic speech recognition (ASR) degrades significantly in noisy environments and is particularly vulnerable to interfering speech, as the model cannot determine which speaker to transcribe. Audio-visual speech recognition (AVSR) systems improve robustness by complementing the audio stream with the visual information that is invariant to noise and helps the model focus on the desired speaker. However, previous AVSR work focused solely on the supervised learning setup; hence the progress was hindered by the amount of labeled data available. In this work, we present a self-supervised AVSR framework built upon Audio-Visual HuBERT (AV-HuBERT), a state-of-the-art audio-visual speech representation learning model. On the largest available AVSR benchmark dataset LRS3, our approach outperforms prior state-of-the-art by ~50% (28.0% vs. 14.1%) using less than 10% of labeled data (433hr vs. 30hr) in the presence of babble noise, while reducing the WER of an audio-based model by over 75% (25.8% vs. 5.8%) on average.
- North America > United States > Illinois > Cook County > Chicago (0.04)
- Europe > Portugal > Braga > Braga (0.04)