A Lightweight Pipeline for Noisy Speech Voice Cloning and Accurate Lip Sync Synthesis

Amir, Javeria, Attaria, Farwa, Jabeen, Mah, Noor, Umara, Rashid, Zahid

arXiv.org Artificial Intelligence 

Corresponding Author: Umara Noor Abstract Recent developments in voice cloning and talking-head generation demonstrate impressive capabilities in synthesizing natural speech and realistic lip synchronization. Current methods typically require and are trained on large-scale datasets and computationally intensive processes using clean, studio-recorded inputs, which is infeasible in noisy or low-resource environments. In this paper, we introduce a new modular pipeline comprising Tortoise text to speech, a transformer-based latent diffusion model that can perform high-fidelity zero-shot voice cloning given only a few training samples, and Wav2Lip, a lightweight generative adversarial network architecture for robust real-time lip synchronization. The solution will contribute to many essential tasks concerning less reliance on massive pretraining, generation of emotionally expressive speech, and lip-sync in noisy and unconstrained scenarios. In addition, the modular structure of the pipeline allows an easy extension for future multimodal and text-guided voice modulation, and it could be used in real-world systems. Our experimental results show that the proposed system produces competition-level sound quality and lip-sync with a much smaller computational cost, indicating the possibility of deploying it in resource-constrained scenarios. Keywords Zero-Shot Voice Cloning, Latent Diffusion Models, Real-Time Lip Synchronization, GAN-Based Talking-Head Generation, Low-Resource Speech Synthesis, Emotionally Expressive Speech 1. Introduction Voice clone and talking head generation systems have made tremendous progress in the past few years, benefiting from the development of deep and generative models. These devices can be employed for virtual assistants, entertainment, telepresence, and assistive communication, making human-computer interaction more realistic and personalized, based on interactive and audio-visual context. Despite advancements, the state-of-the-art solutions heavily rely on big data and sophisticated computational resources and therefore may not be practical for real-world low-resource or noisy settings.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found