face frame
Towards Expressive Video Dubbing with Multiscale Multimodal Context Interaction
Zhao, Yuan, Liu, Rui, Cong, Gaoxiang
Automatic Video Dubbing (AVD) generates speech aligned with lip motion and facial emotion from scripts. Recent research focuses on modeling multimodal context to enhance prosody expressiveness but overlooks two key issues: 1) Multiscale prosody expression attributes in the context influence the current sentence's prosody. 2) Prosody cues in context interact with the current sentence, impacting the final prosody expressiveness. To tackle these challenges, we propose M2CI-Dubber, a Multiscale Multimodal Context Interaction scheme for AVD. This scheme includes two shared M2CI encoders to model the multiscale multimodal context and facilitate its deep interaction with the current sentence. By extracting global and local features for each modality in the context, utilizing attention-based mechanisms for aggregation and interaction, and employing an interaction-based graph attention network for fusion, the proposed approach enhances the prosody expressiveness of synthesized speech for the current sentence. Experiments on the Chem dataset show our model outperforms baselines in dubbing expressiveness. The code and demos are available at \textcolor[rgb]{0.93,0.0,0.47}{https://github.com/AI-S2-Lab/M2CI-Dubber}.
- Research Report > Experimental Study (0.46)
- Research Report > New Finding (0.46)
Continuous Speech Recognition using EEG and Video
Krishna, Gautam, Carnahan, Mason, Tran, Co, Tewfik, Ahmed H
--In this paper we investigate whether electroen-cephalography (EEG) features can be used to improve the performance of continuous visual speech recognition systems. We implemented a connectionist temporal classification (CTC) based end-to-end automatic speech recognition (ASR) model for performing recognition. Our results demonstrate that EEG features are helpful in enhancing the performance of continuous visual speech recognition systems. In recent years there has been lot of interesting work done in the fields of lip reading and audio visual speech recognition. In [1] authors demonstrated end-to-end sentence level lip reading and in [2] authors demonstrated deep learning based end-to- end audio visual speech recognition.
- North America > United States > Texas > Travis County > Austin (0.15)
- North America > United States > Texas > Mason County > Mason (0.04)
- Health & Medicine > Therapeutic Area (0.69)
- Health & Medicine > Diagnostic Medicine (0.47)