visual speech recognition
Designing Practical Models for Isolated Word Visual Speech Recognition
Panagos, Iason Ioannis, Sfikas, Giorgos, Nikou, Christophoros
Visual speech recognition (VSR) systems decode spoken words from an input sequence using only the video data. Practical applications of such systems include medical assistance as well as human-machine interactions. A VSR system is typically employed in a complementary role in cases where the audio is corrupt or not available. In order to accurately predict the spoken words, these architectures often rely on deep neural networks in order to extract meaningful representations from the input sequence. While deep architectures achieve impressive recognition performance, relying on such models incurs significant computation costs which translates into increased resource demands in terms of hardware requirements and results in limited applicability in real-world scenarios where resources might be constrained. This factor prevents wider adoption and deployment of speech recognition systems in more practical applications. In this work, we aim to alleviate this issue by developing architectures for VSR that have low hardware costs. Following the standard two-network design paradigm, where one network handles visual feature extraction and another one utilizes the extracted features to classify the entire sequence, we develop lightweight end-to-end architectures by first benchmarking e fficient models from the image classification literature, and then adopting lightweight block designs in a temporal convolution network backbone. We create several unified models with low resource requirements but strong recognition performance. Experiments on the largest public database for English words demonstrate the e ff ectiveness and practicality of our developed models. Code and trained models will be made publicly available.
A Experiment Details
CC BY and have been used extensively by the research communities. Fine-tuning Table A2 summarizes the hyperparameters used for ASR fine-tuning. By default, the one pre-trained with modality dropout is used. Table B4 shows how fine-tuning modality dropout configurations affect ASR performance. Next, we study the impact of hyperparameters when fine-tuning on unimodal data.
SyncVSR: Data-Efficient Visual Speech Recognition with End-to-End Crossmodal Audio Token Synchronization
Ahn, Young Jin, Park, Jungwoo, Park, Sangha, Choi, Jonghyun, Kim, Kee-Eung
Visual Speech Recognition (VSR) stands at the intersection of computer vision and speech recognition, aiming to interpret spoken content from visual cues. A prominent challenge in VSR is the presence of homophenes-visually similar lip gestures that represent different phonemes. Prior approaches have sought to distinguish fine-grained visemes by aligning visual and auditory semantics, but often fell short of full synchronization. To address this, we present SyncVSR, an end-to-end learning framework that leverages quantized audio for frame-level crossmodal supervision. By integrating a projection layer that synchronizes visual representation with acoustic data, our encoder learns to generate discrete audio tokens from a video sequence in a non-autoregressive manner. SyncVSR shows versatility across tasks, languages, and modalities at the cost of a forward pass. Our empirical evaluations show that it not only achieves state-of-the-art results but also reduces data usage by up to ninefold.
Where Visual Speech Meets Language: VSP-LLM Framework for Efficient and Context-Aware Visual Speech Processing
Yeo, Jeong Hun, Han, Seunghee, Kim, Minsu, Ro, Yong Man
In visual speech processing, context modeling capability is one of the most important requirements due to the ambiguous nature of lip movements. For example, homophenes, words that share identical lip movements but produce different sounds, can be distinguished by considering the context. In this paper, we propose a novel framework, namely Visual Speech Processing incorporated with LLMs (VSP-LLM), to maximize the context modeling ability by bringing the overwhelming power of LLMs. Specifically, VSP-LLM is designed to perform multi-tasks of visual speech recognition and translation, where the given instructions control the type of task. The input video is mapped to the input latent space of an LLM by employing a self-supervised visual speech model. Focused on the fact that there is redundant information in input frames, we propose a novel deduplication method that reduces the embedded visual features by employing visual speech units. Through the proposed deduplication and Low Rank Adaptation (LoRA), VSP-LLM can be trained in a computationally efficient manner. In the translation dataset, the MuAViC benchmark, we demonstrate that VSP-LLM trained on just 30 hours of labeled data can more effectively translate lip movements compared to the recent model trained with 433 hours of data.
JEP-KD: Joint-Embedding Predictive Architecture Based Knowledge Distillation for Visual Speech Recognition
Sun, Chang, Yang, Hong, Qin, Bo
Visual Speech Recognition (VSR) tasks are generally recognized to have a lower theoretical performance ceiling than Automatic Speech Recognition (ASR), owing to the inherent limitations of conveying semantic information visually. To mitigate this challenge, this paper introduces an advanced knowledge distillation approach using a Joint-Embedding Predictive Architecture (JEPA), named JEP-KD, designed to more effectively utilize audio features during model training. Central to JEP-KD is the inclusion of a generative network within the embedding layer, which enhances the video encoder's capacity for semantic feature extraction and brings it into closer alignment with the audio features from a pre-trained ASR model's encoder. This approach aims to progressively reduce the performance gap between VSR and ASR. Moreover, a comprehensive multimodal, multistage training regimen for the JEP-KD framework is established, bolstering the robustness and efficacy of the training process. Experiment results demonstrate that JEP-KD significantly improves the performance of VSR models and demonstrates versatility across different VSR platforms, indicating its potential for broader application within other multimodal tasks.
The NPU-ASLP-LiAuto System Description for Visual Speech Recognition in CNVSRC 2023
Wang, He, Guo, Pengcheng, Chen, Wei, Zhou, Pan, Xie, Lei
This paper delineates the visual speech recognition (VSR) system introduced by the NPU-ASLP-LiAuto (Team 237) in the first Chinese Continuous Visual Speech Recognition Challenge (CNVSRC) 2023, engaging in the fixed and open tracks of Single-Speaker VSR Task, and the open track of Multi-Speaker VSR Task. In terms of data processing, we leverage the lip motion extractor from the baseline1 to produce multi-scale video data. Besides, various augmentation techniques are applied during training, encompassing speed perturbation, random rotation, horizontal flipping, and color transformation. The VSR model adopts an end-to-end architecture with joint CTC/attention loss, comprising a ResNet3D visual frontend, an E-Branchformer encoder, and a Transformer decoder. Experiments show that our system achieves 34.76% CER for the Single-Speaker Task and 41.06% CER for the Multi-Speaker Task after multi-system fusion, ranking first place in all three tracks we participate.
Speaker-Adapted End-to-End Visual Speech Recognition for Continuous Spanish
Gimeno-Gómez, David, Martínez-Hinarejos, Carlos-D.
Different studies have shown the importance of visual cues throughout the speech perception process. In fact, the development of audiovisual approaches has led to advances in the field of speech technologies. However, although noticeable results have recently been achieved, visual speech recognition remains an open research problem. It is a task in which, by dispensing with the auditory sense, challenges such as visual ambiguities and the complexity of modeling silence must be faced. Nonetheless, some of these challenges can be alleviated when the problem is approached from a speaker-dependent perspective. Thus, this paper studies, using the Spanish LIP-RTVE database, how the estimation of specialized end-to-end systems for a specific person could affect the quality of speech recognition. First, different adaptation strategies based on the fine-tuning technique were proposed. Then, a pre-trained CTC/Attention architecture was used as a baseline throughout our experiments. Our findings showed that a two-step fine-tuning process, where the VSR system is first adapted to the task domain, provided significant improvements when the speaker adaptation was addressed. Furthermore, results comparable to the current state of the art were reached even when only a limited amount of data was available.
Automated Speaker Independent Visual Speech Recognition: A Comprehensive Survey
Nemani, Praneeth, Krishna, G. Sai, Kundrapu, Supriya
Speaker-independent VSR is a complex task that involves identifying spoken words or phrases from video recordings of a speaker's facial movements. Over the years, there has been a considerable amount of research in the field of VSR involving different algorithms and datasets to evaluate system performance. These efforts have resulted in significant progress in developing effective VSR models, creating new opportunities for further research in this area. This survey provides a detailed examination of the progression of VSR over the past three decades, with a particular emphasis on the transition from speaker-dependent to speaker-independent systems. We also provide a comprehensive overview of the various datasets used in VSR research and the preprocessing techniques employed to achieve speaker independence. The survey covers the works published from 1990 to 2023, thoroughly analyzing each work and comparing them on various parameters. This survey provides an in-depth analysis of speaker-independent VSR systems evolution from 1990 to 2023. It outlines the development of VSR systems over time and highlights the need to develop end-to-end pipelines for speaker-independent VSR. The pictorial representation offers a clear and concise overview of the techniques used in speaker-independent VSR, thereby aiding in the comprehension and analysis of the various methodologies. The survey also highlights the strengths and limitations of each technique and provides insights into developing novel approaches for analyzing visual speech cues. Overall, This comprehensive review provides insights into the current state-of-the-art speaker-independent VSR and highlights potential areas for future research.
Deep Learning-based Spatio Temporal Facial Feature Visual Speech Recognition
Kumar, Pangoth Santhosh, Akshay, Garika
In low-resource computing contexts, such as smartphones and other tiny devices, Both deep learning and machine learning are being used in a lot of identification systems. as authentication techniques. The transparent, contactless, and non-invasive nature of these face recognition technologies driven by AI has led to their meteoric rise in popularity in recent years. While they are mostly successful, there are still methods to get inside without permission by utilising things like pictures, masks, glasses, etc. In this research, we present an alternate authentication process that makes use of both facial recognition and the individual's distinctive temporal facial feature motions while they speak a password. Because the suggested methodology allows for a password to be specified in any language, it is not limited by language. The suggested model attained an accuracy of 96.1% when tested on the industry-standard MIRACL-VC1 dataset, demonstrating its efficacy as a reliable and powerful solution. In addition to being data-efficient, the suggested technique shows promising outcomes with as little as 10 positive video examples for training the model. The effectiveness of the network's training is further proved via comparisons with other combined facial recognition and lip reading models.