lipreading
Learning Speaker-Invariant Visual Features for Lipreading
Li, Yu, Xue, Feng, Li, Shujie, Zhang, Jinrui, Yang, Shuang, Guo, Dan, Hong, Richang
Lipreading is a challenging cross-modal task that aims to convert visual lip movements into spoken text. Existing lipreading methods often extract visual features that include speaker-specific lip attributes (e.g., shape, color, texture), which introduce spurious correlations between vision and text. These correlations lead to suboptimal lipreading accuracy and restrict model generalization. To address this challenge, we introduce SIFLip, a speaker-invariant visual feature learning framework that disentangles speaker-specific attributes using two complementary disentanglement modules (Implicit Disentanglement and Explicit Disentanglement) to improve generalization. Specifically, since different speakers exhibit semantic consistency between lip movements and phonetic text when pronouncing the same words, our implicit disentanglement module leverages stable text embeddings as supervisory signals to learn common visual representations across speakers, implicitly decoupling speaker-specific features. Additionally, we design a speaker recognition sub-task within the main lipreading pipeline to filter speaker-specific features, then further explicitly disentangle these personalized visual features from the backbone network via gradient reversal. Experimental results demonstrate that SIFLip significantly enhances generalization performance across multiple public datasets. Experimental results demonstrate that SIFLip significantly improves generalization performance across multiple public datasets, outperforming state-of-the-art methods.
- North America > Canada > Quebec > Montreal (0.04)
- Europe > Switzerland (0.04)
- Asia > Taiwan > Taiwan Province > Taipei (0.04)
- Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
Lipreading by neural networks: Visual preprocessing, learning, and sensory integration
We have developed visual preprocessing algorithms for extracting phonologically relevant features from the grayscale video image of a speaker, to provide speaker-independent inputs for an automat(cid:173) ic lipreading ("speechreading") system. Visual features such as mouth open/closed, tongue visible/not-visible, teeth visible/not(cid:173) visible, and several shape descriptors of the mouth and its motion are all rapidly computable in a manner quite insensitive to lighting conditions. We formed a hybrid speechreading system consisting of two time delay neural networks (video and acoustic) and inte(cid:173) grated their responses by means of independent opinion pooling - the Bayesian optimal method given conditional independence, which seems to hold for our data. This hybrid system had an er(cid:173) ror rate 25% lower than that of the acoustic subsystem alone on a five-utterance speaker-independent task, indicating that video can be used to improve speech recognition.
Surface Learning with Applications to Lipreading
Most connectionist research has focused on learning mappings from one space to another (eg. This paper introduces the more general task of learning constraint surfaces. It describes a simple but powerful architecture for learning and manipulating nonlinear surfaces from data. We demonstrate the technique on low dimensional synthetic surfaces and compare it to nearest neighbor approaches. We then show its utility in learning the space of lip images in a system for improving speech recognition by lip reading.
Beyond Lipreading: Visual Speech Recognition Looks You in the Eye
Like the lipreading spies of yesteryear peering through their binoculars, almost all visual speech recognition VSR research these days focuses on mouth and lip motion. But a new study suggests that VSR models could perform even better if they used additional available visual information. The VSR field typically looks at the mouth region since it is believed that lip shape and motion contain almost all the information correlated with speech. This has made the information in other facial regions considered as weak by default. But a new paper from the Key Laboratory of Intelligent Information Processing of the Chinese Academy of Sciences and the University of Chinese Academy of Sciences proposes that information from extraoral facial regions can consistently benefit SOTA VSR model performance.
- Research Report > New Finding (0.52)
- Research Report > Experimental Study (0.37)
Lipper: Synthesizing Thy Speech using Multi-View Lipreading
Kumar, Yaman, Jain, Rohit, Salik, Khwaja Mohd., Shah, Rajiv Ratn, yin, Yifang, Zimmermann, Roger
Lipreading has a lot of potential applications such as in the domain of surveillance and video conferencing. Despite this, most of the work in building lipreading systems has been limited to classifying silent videos into classes representing text phrases. However, there are multiple problems associated with making lipreading a text-based classification task like its dependence on a particular language and vocabulary mapping. Thus, in this paper we propose a multi-view lipreading to audio system, namely Lipper, which models it as a regression task. The model takes silent videos as input and produces speech as the output. With multi-view silent videos, we observe an improvement over single-view speech reconstruction results. We show this by presenting an exhaustive set of experiments for speaker-dependent, out-of-vocabulary and speaker-independent settings. Further, we compare the delay values of Lipper with other speechreading systems in order to show the real-time nature of audio produced. We also perform a user study for the audios produced in order to understand the level of comprehensibility of audios produced using Lipper.
- North America > United States > California > Alameda County > Berkeley (0.15)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.04)
- North America > United States > Illinois (0.04)
- North America > United States > California > San Mateo County > San Mateo (0.04)
Lipreading by neural networks: Visual preprocessing, learning, and sensory integration
Wolff, Gregory J., Prasad, K. Venkatesh, Stork, David G., Hennecke, Marcus
Automated speech recognition is notoriously hard, and thus any predictive source of information and constraints that could be incorporated into a computer speech recognition system would be desirable. Humans, especially the hearing impaired, can utilize visual information - "speech reading" - for improved accuracy (Dodd & Campbell, 1987, Sanders & Goodrich, 1971). Speech reading can provide direct information about segments, phonemes, rate, speaker gender and identity, and subtle informationfor segmenting speech from background noise or multiple speakers (De Filippo & Sims, 1988, Green & Miller, 1985). Fundamental support for the use of visual information comes from the complementary natureof the visual and acoustic speech signals. Utterances that are difficult to distinguish acoustically are the easiest to distinguish.
- North America > United States > California > San Mateo County > Menlo Park (0.05)
- North America > United States > New York (0.04)
- North America > United States > California > Santa Clara County > Stanford (0.04)
Lipreading by neural networks: Visual preprocessing, learning, and sensory integration
Wolff, Gregory J., Prasad, K. Venkatesh, Stork, David G., Hennecke, Marcus
Automated speech recognition is notoriously hard, and thus any predictive source of information and constraints that could be incorporated into a computer speech recognition system would be desirable. Humans, especially the hearing impaired, can utilize visual information - "speech reading" - for improved accuracy (Dodd & Campbell, 1987, Sanders & Goodrich, 1971). Speech reading can provide direct information about segments, phonemes, rate, speaker gender and identity, and subtle information for segmenting speech from background noise or multiple speakers (De Filippo & Sims, 1988, Green & Miller, 1985). Fundamental support for the use of visual information comes from the complementary nature of the visual and acoustic speech signals. Utterances that are difficult to distinguish acoustically are the easiest to distinguish.
- North America > United States > California > San Mateo County > Menlo Park (0.05)
- North America > United States > New York (0.04)
- North America > United States > California > Santa Clara County > Stanford (0.04)
- North America > United States > California > Alameda County > Berkeley (0.05)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.04)
- North America > United States > Illinois (0.04)
- North America > United States > California > San Mateo County > San Mateo (0.04)
Lipreading by neural networks: Visual preprocessing, learning, and sensory integration
Wolff, Gregory J., Prasad, K. Venkatesh, Stork, David G., Hennecke, Marcus
Automated speech recognition is notoriously hard, and thus any predictive source of information and constraints that could be incorporated into a computer speech recognition system would be desirable. Humans, especially the hearing impaired, can utilize visual information - "speech reading" - for improved accuracy (Dodd & Campbell, 1987, Sanders & Goodrich, 1971). Speech reading can provide direct information about segments, phonemes, rate, speaker gender and identity, and subtle information for segmenting speech from background noise or multiple speakers (De Filippo & Sims, 1988, Green & Miller, 1985). Fundamental support for the use of visual information comes from the complementary nature of the visual and acoustic speech signals. Utterances that are difficult to distinguish acoustically are the easiest to distinguish.
- North America > United States > California > San Mateo County > Menlo Park (0.05)
- North America > United States > New York (0.04)
- North America > United States > California > Santa Clara County > Stanford (0.04)