Goto

Collaborating Authors

 lipreading


Learning Speaker-Invariant Visual Features for Lipreading

Li, Yu, Xue, Feng, Li, Shujie, Zhang, Jinrui, Yang, Shuang, Guo, Dan, Hong, Richang

arXiv.org Artificial Intelligence

Lipreading is a challenging cross-modal task that aims to convert visual lip movements into spoken text. Existing lipreading methods often extract visual features that include speaker-specific lip attributes (e.g., shape, color, texture), which introduce spurious correlations between vision and text. These correlations lead to suboptimal lipreading accuracy and restrict model generalization. To address this challenge, we introduce SIFLip, a speaker-invariant visual feature learning framework that disentangles speaker-specific attributes using two complementary disentanglement modules (Implicit Disentanglement and Explicit Disentanglement) to improve generalization. Specifically, since different speakers exhibit semantic consistency between lip movements and phonetic text when pronouncing the same words, our implicit disentanglement module leverages stable text embeddings as supervisory signals to learn common visual representations across speakers, implicitly decoupling speaker-specific features. Additionally, we design a speaker recognition sub-task within the main lipreading pipeline to filter speaker-specific features, then further explicitly disentangle these personalized visual features from the backbone network via gradient reversal. Experimental results demonstrate that SIFLip significantly enhances generalization performance across multiple public datasets. Experimental results demonstrate that SIFLip significantly improves generalization performance across multiple public datasets, outperforming state-of-the-art methods.


Lipreading by neural networks: Visual preprocessing, learning, and sensory integration

Neural Information Processing Systems

We have developed visual preprocessing algorithms for extracting phonologically relevant features from the grayscale video image of a speaker, to provide speaker-independent inputs for an automat(cid:173) ic lipreading ("speechreading") system. Visual features such as mouth open/closed, tongue visible/not-visible, teeth visible/not(cid:173) visible, and several shape descriptors of the mouth and its motion are all rapidly computable in a manner quite insensitive to lighting conditions. We formed a hybrid speechreading system consisting of two time delay neural networks (video and acoustic) and inte(cid:173) grated their responses by means of independent opinion pooling - the Bayesian optimal method given conditional independence, which seems to hold for our data. This hybrid system had an er(cid:173) ror rate 25% lower than that of the acoustic subsystem alone on a five-utterance speaker-independent task, indicating that video can be used to improve speech recognition.


Surface Learning with Applications to Lipreading

Neural Information Processing Systems

Most connectionist research has focused on learning mappings from one space to another (eg. This paper introduces the more general task of learning constraint surfaces. It describes a simple but powerful architecture for learning and manipulating nonlinear surfaces from data. We demonstrate the technique on low dimensional synthetic surfaces and compare it to nearest neighbor approaches. We then show its utility in learning the space of lip images in a system for improving speech recognition by lip reading.


Beyond Lipreading: Visual Speech Recognition Looks You in the Eye

#artificialintelligence

Like the lipreading spies of yesteryear peering through their binoculars, almost all visual speech recognition VSR research these days focuses on mouth and lip motion. But a new study suggests that VSR models could perform even better if they used additional available visual information. The VSR field typically looks at the mouth region since it is believed that lip shape and motion contain almost all the information correlated with speech. This has made the information in other facial regions considered as weak by default. But a new paper from the Key Laboratory of Intelligent Information Processing of the Chinese Academy of Sciences and the University of Chinese Academy of Sciences proposes that information from extraoral facial regions can consistently benefit SOTA VSR model performance.


Lipper: Synthesizing Thy Speech using Multi-View Lipreading

Kumar, Yaman, Jain, Rohit, Salik, Khwaja Mohd., Shah, Rajiv Ratn, yin, Yifang, Zimmermann, Roger

arXiv.org Machine Learning

Lipreading has a lot of potential applications such as in the domain of surveillance and video conferencing. Despite this, most of the work in building lipreading systems has been limited to classifying silent videos into classes representing text phrases. However, there are multiple problems associated with making lipreading a text-based classification task like its dependence on a particular language and vocabulary mapping. Thus, in this paper we propose a multi-view lipreading to audio system, namely Lipper, which models it as a regression task. The model takes silent videos as input and produces speech as the output. With multi-view silent videos, we observe an improvement over single-view speech reconstruction results. We show this by presenting an exhaustive set of experiments for speaker-dependent, out-of-vocabulary and speaker-independent settings. Further, we compare the delay values of Lipper with other speechreading systems in order to show the real-time nature of audio produced. We also perform a user study for the audios produced in order to understand the level of comprehensibility of audios produced using Lipper.



Lipreading by neural networks: Visual preprocessing, learning, and sensory integration

Wolff, Gregory J., Prasad, K. Venkatesh, Stork, David G., Hennecke, Marcus

Neural Information Processing Systems

Automated speech recognition is notoriously hard, and thus any predictive source of information and constraints that could be incorporated into a computer speech recognition system would be desirable. Humans, especially the hearing impaired, can utilize visual information - "speech reading" - for improved accuracy (Dodd & Campbell, 1987, Sanders & Goodrich, 1971). Speech reading can provide direct information about segments, phonemes, rate, speaker gender and identity, and subtle informationfor segmenting speech from background noise or multiple speakers (De Filippo & Sims, 1988, Green & Miller, 1985). Fundamental support for the use of visual information comes from the complementary natureof the visual and acoustic speech signals. Utterances that are difficult to distinguish acoustically are the easiest to distinguish.


Lipreading by neural networks: Visual preprocessing, learning, and sensory integration

Wolff, Gregory J., Prasad, K. Venkatesh, Stork, David G., Hennecke, Marcus

Neural Information Processing Systems

Automated speech recognition is notoriously hard, and thus any predictive source of information and constraints that could be incorporated into a computer speech recognition system would be desirable. Humans, especially the hearing impaired, can utilize visual information - "speech reading" - for improved accuracy (Dodd & Campbell, 1987, Sanders & Goodrich, 1971). Speech reading can provide direct information about segments, phonemes, rate, speaker gender and identity, and subtle information for segmenting speech from background noise or multiple speakers (De Filippo & Sims, 1988, Green & Miller, 1985). Fundamental support for the use of visual information comes from the complementary nature of the visual and acoustic speech signals. Utterances that are difficult to distinguish acoustically are the easiest to distinguish.



Lipreading by neural networks: Visual preprocessing, learning, and sensory integration

Wolff, Gregory J., Prasad, K. Venkatesh, Stork, David G., Hennecke, Marcus

Neural Information Processing Systems

Automated speech recognition is notoriously hard, and thus any predictive source of information and constraints that could be incorporated into a computer speech recognition system would be desirable. Humans, especially the hearing impaired, can utilize visual information - "speech reading" - for improved accuracy (Dodd & Campbell, 1987, Sanders & Goodrich, 1971). Speech reading can provide direct information about segments, phonemes, rate, speaker gender and identity, and subtle information for segmenting speech from background noise or multiple speakers (De Filippo & Sims, 1988, Green & Miller, 1985). Fundamental support for the use of visual information comes from the complementary nature of the visual and acoustic speech signals. Utterances that are difficult to distinguish acoustically are the easiest to distinguish.