VisualSpeech: Enhance Prosody with Visual Context in TTS

Que, Shumin, Ragni, Anton

arXiv.org Artificial Intelligence 

However, Text-to-Speech (TTS) synthesis faces the inherent challenge of no previous studies have explored the impact of visual information producing multiple speech outputs with varying prosody from on prosody in TTS. a single text input. While previous research has addressed this Thus, this paper explores the possibility of improving by predicting prosodic information from both text and speech, prosody prediction in TTS by means of visual information. It additional contextual information, such as visual features, remains makes three key contributions. First, it demonstrates that visual underutilized. This paper investigates the potential of cues carry valuable prosodic information. Second, it establishes integrating visual context to enhance prosody prediction. We that this visual information complements existing textual propose a novel model, VisualSpeech, which incorporates both features rather than being redundant. Finally, it reveals that integrating visual and textual information for improved prosody generation.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found