Length Aware Speech Translation for Video Dubbing
Chadha, Harveen Singh, Subramanian, Aswin Shanmugam, Joshi, Vikas, Bansal, Shubham, Xue, Jian, Mehta, Rupeshkumar, Li, Jinyu
–arXiv.org Artificial Intelligence
In video dubbing, aligning translated audio with the source audio is a significant challenge. Our focus is on achieving this efficiently, tailored for real-time, on-device video dubbing scenarios. We developed a phoneme-based end-to-end length-sensitive speech translation (LSST) model, which generates translations of varying lengths--short, normal, and long--using predefined tags. Additionally, we introduced length-aware beam search (LABS), an efficient approach to generate translations of different lengths in a single decoding pass. This approach maintained comparable BLEU scores compared to a baseline without length awareness while significantly enhancing synchronization quality between source and target audio, achieving a mean opinion score (MOS) gain of 0.34 for Spanish and 0.65 for Korean, respectively.
arXiv.org Artificial Intelligence
Jun-3-2025