Improving Lip-synchrony in Direct Audio-Visual Speech-to-Speech Translation

Goncalves, Lucas, Mathur, Prashant, Niu, Xing, Houston, Brady, Lavania, Chandrashekhar, Vishnubhotla, Srikanth, Sun, Lijia, Ferritto, Anthony

arXiv.org Artificial Intelligence 

Audio-Visual Speech-to-Speech Translation typically prioritizes improving translation quality and naturalness. However, an equally critical aspect in audio-visual content is lip-synchrony-ensuring that the movements of the lips match the spoken content-essential for maintaining realism in dubbed videos. Despite its importance, the inclusion of lip-synchrony constraints in AVS2S models has been largely overlooked. This study addresses this gap by integrating a lip-synchrony loss into the training process of AVS2S models. Our proposed method significantly enhances lip-synchrony in direct audio-visual speech-to-speech translation, achieving an average LSE-D score of 10.67, representing a 9.2% reduction in LSE-D over a strong baseline across four language pairs. Additionally, it maintains the naturalness and high quality of the translated speech when overlaid onto the original video, without any degradation in translation quality.