SyncVoice: Towards Video Dubbing with Vision-Augmented Pretrained TTS Model
Wang, Kaidi, He, Yi, Guan, Wenhao, Wu, Weijie, Ding, Hongwu, Zhang, Xiong, Wu, Di, Meng, Meng, Luan, Jian, Li, Lin, Hong, Qingyang
–arXiv.org Artificial Intelligence
Video dubbing aims to generate high-fidelity speech that is precisely temporally aligned with the visual content. Existing methods still suffer from limitations in speech naturalness and audio-visual synchronization, and are limited to monolingual settings. To address these challenges, we propose SyncVoice, a vision-augmented video dubbing framework built upon a pretrained text-to-speech (TTS) model. By fine-tuning the TTS model on audio-visual data, we achieve strong audiovisual consistency. We propose a Dual Speaker Encoder to effectively mitigate inter-language interference in cross-lingual speech synthesis and explore the application of video dubbing in video translation scenarios. Experimental results show that SyncVoice achieves high-fidelity speech generation with strong synchronization performance, demonstrating its potential in video dubbing tasks.
arXiv.org Artificial Intelligence
Dec-8-2025
- Country:
- Asia > China > Fujian Province > Xiamen (0.04)
- Genre:
- Research Report > New Finding (0.67)
- Technology:
- Information Technology > Artificial Intelligence
- Machine Learning (1.00)
- Natural Language (1.00)
- Speech
- Speech Recognition (0.47)
- Speech Synthesis (0.56)
- Vision (1.00)
- Information Technology > Artificial Intelligence