SyncVoice: Towards Video Dubbing with Vision-Augmented Pretrained TTS Model

Wang, Kaidi, He, Yi, Guan, Wenhao, Wu, Weijie, Ding, Hongwu, Zhang, Xiong, Wu, Di, Meng, Meng, Luan, Jian, Li, Lin, Hong, Qingyang

Dec-8-2025–arXiv.org Artificial Intelligence

Video dubbing aims to generate high-fidelity speech that is precisely temporally aligned with the visual content. Existing methods still suffer from limitations in speech naturalness and audio-visual synchronization, and are limited to monolingual settings. To address these challenges, we propose SyncVoice, a vision-augmented video dubbing framework built upon a pretrained text-to-speech (TTS) model. By fine-tuning the TTS model on audio-visual data, we achieve strong audiovisual consistency. We propose a Dual Speaker Encoder to effectively mitigate inter-language interference in cross-lingual speech synthesis and explore the application of video dubbing in video translation scenarios. Experimental results show that SyncVoice achieves high-fidelity speech generation with strong synchronization performance, demonstrating its potential in video dubbing tasks.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

Dec-8-2025

arXiv.org PDF

Add feedback

Country:
- Asia > China > Fujian Province (0.14)

Genre:
- Research Report > New Finding (0.67)

Technology:
- Information Technology > Artificial Intelligence
  - Vision (1.00)
  - Natural Language (1.00)
  - Machine Learning (1.00)
  - Speech
    - Speech Synthesis (0.56)
    - Speech Recognition (0.47)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found