Visual-Aware Text-to-Speech

Zhou, Mohan, Bai, Yalong, Zhang, Wei, Yao, Ting, Zhao, Tiejun, Mei, Tao

Jun-21-2023–arXiv.org Artificial Intelligence

Dynamically synthesizing talking speech that actively responds to a listening head is critical during the face-to-face interaction. For example, the speaker could take advantage of the listener's facial expression to adjust the tones, stressed syllables, or pauses. In this work, we present a new visual-aware text-to-speech (VA-TTS) task to synthesize speech conditioned on both textual inputs and sequential visual feedback (e.g., nod, smile) of the listener in face-to-face communication. Different from traditional text-to-speech, VA-TTS highlights the impact of visual modality. On this newly-minted task, we devise a baseline model to fuse phoneme linguistic information and listener visual signals for speech synthesis. Extensive experiments on multimodal conversation dataset ViCo-X verify our proposal for generating more natural audio with scenario-appropriate rhythm and prosody.

artificial intelligence, machine learning, optical character recognition, (18 more...)

arXiv.org Artificial Intelligence

Jun-21-2023

arXiv.org PDF

Add feedback

Country:
- North America > Canada
  - Quebec > Montreal (0.05)
- Asia > China
  - Heilongjiang Province > Harbin (0.05)
  - Beijing > Beijing (0.04)

Genre:
- Research Report (0.50)

Industry:
- Information Technology > Security & Privacy (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Speech > Speech Synthesis (1.00)
  - Machine Learning > Neural Networks (1.00)
  - Vision > Optical Character Recognition (0.83)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found