Goto

Collaborating Authors

 lip movement





Lip to Speech Synthesis with Visual Context Attentional GAN

Neural Information Processing Systems

In this paper, we propose a novel lip-to-speech generative adversarial network, Visual Context Attentional GAN (VCA-GAN), which can jointly model local and global lip movements during speech synthesis. Specifically, the proposed VCA-GAN synthesizes the speech from local lip visual features by finding a mapping function of viseme-to-phoneme, while global visual context is embedded into the intermediate layers of the generator to clarify the ambiguity in the mapping induced by homophene. To achieve this, a visual context attention module is proposed where it encodes global representations from the local visual features, and provides the desired global visual context corresponding to the given coarse speech representation to the generator through audio-visual attention. In addition to the explicit modelling of local and global visual representations, synchronization learning is introduced as a form of contrastive learning that guides the generator to synthesize a speech in sync with the given input lip movements. Extensive experiments demonstrate that the proposed VCA-GAN outperforms existing state-of-the-art and is able to effectively synthesize the speech from multi-speaker that has been barely handled in the previous works.


SyncVoice: Towards Video Dubbing with Vision-Augmented Pretrained TTS Model

Wang, Kaidi, He, Yi, Guan, Wenhao, Wu, Weijie, Ding, Hongwu, Zhang, Xiong, Wu, Di, Meng, Meng, Luan, Jian, Li, Lin, Hong, Qingyang

arXiv.org Artificial Intelligence

Video dubbing aims to generate high-fidelity speech that is precisely temporally aligned with the visual content. Existing methods still suffer from limitations in speech naturalness and audio-visual synchronization, and are limited to monolingual settings. To address these challenges, we propose SyncVoice, a vision-augmented video dubbing framework built upon a pretrained text-to-speech (TTS) model. By fine-tuning the TTS model on audio-visual data, we achieve strong audiovisual consistency. We propose a Dual Speaker Encoder to effectively mitigate inter-language interference in cross-lingual speech synthesis and explore the application of video dubbing in video translation scenarios. Experimental results show that SyncVoice achieves high-fidelity speech generation with strong synchronization performance, demonstrating its potential in video dubbing tasks.



Language Without Borders: A Dataset and Benchmark for Code-Switching Lip Reading

Neural Information Processing Systems

Lip reading aims at transforming the videos of continuous lip movement into textual contents, and has achieved significant progress over the past decade. It serves as a critical yet practical assistance for speech-impaired individuals, with more practicability than speech recognition in noisy environments.


Lip to Speech Synthesis with Visual Context Attentional GAN

Neural Information Processing Systems

Specifically, the proposed VCA-GAN synthesizes the speech from local lip visual features by finding a mapping function of viseme-to-phoneme, while global visual context is embedded into the intermediate layers of the generator to clarify the ambiguity in the mapping induced by homophene.


A Lightweight Pipeline for Noisy Speech Voice Cloning and Accurate Lip Sync Synthesis

Amir, Javeria, Attaria, Farwa, Jabeen, Mah, Noor, Umara, Rashid, Zahid

arXiv.org Artificial Intelligence

Corresponding Author: Umara Noor Abstract Recent developments in voice cloning and talking-head generation demonstrate impressive capabilities in synthesizing natural speech and realistic lip synchronization. Current methods typically require and are trained on large-scale datasets and computationally intensive processes using clean, studio-recorded inputs, which is infeasible in noisy or low-resource environments. In this paper, we introduce a new modular pipeline comprising Tortoise text to speech, a transformer-based latent diffusion model that can perform high-fidelity zero-shot voice cloning given only a few training samples, and Wav2Lip, a lightweight generative adversarial network architecture for robust real-time lip synchronization. The solution will contribute to many essential tasks concerning less reliance on massive pretraining, generation of emotionally expressive speech, and lip-sync in noisy and unconstrained scenarios. In addition, the modular structure of the pipeline allows an easy extension for future multimodal and text-guided voice modulation, and it could be used in real-world systems. Our experimental results show that the proposed system produces competition-level sound quality and lip-sync with a much smaller computational cost, indicating the possibility of deploying it in resource-constrained scenarios. Keywords Zero-Shot Voice Cloning, Latent Diffusion Models, Real-Time Lip Synchronization, GAN-Based Talking-Head Generation, Low-Resource Speech Synthesis, Emotionally Expressive Speech 1. Introduction Voice clone and talking head generation systems have made tremendous progress in the past few years, benefiting from the development of deep and generative models. These devices can be employed for virtual assistants, entertainment, telepresence, and assistive communication, making human-computer interaction more realistic and personalized, based on interactive and audio-visual context. Despite advancements, the state-of-the-art solutions heavily rely on big data and sophisticated computational resources and therefore may not be practical for real-world low-resource or noisy settings.