VisualSpeech: Enhance Prosody with Visual Context in TTS
–arXiv.org Artificial Intelligence
However, Text-to-Speech (TTS) synthesis faces the inherent challenge of no previous studies have explored the impact of visual information producing multiple speech outputs with varying prosody from on prosody in TTS. a single text input. While previous research has addressed this Thus, this paper explores the possibility of improving by predicting prosodic information from both text and speech, prosody prediction in TTS by means of visual information. It additional contextual information, such as visual features, remains makes three key contributions. First, it demonstrates that visual underutilized. This paper investigates the potential of cues carry valuable prosodic information. Second, it establishes integrating visual context to enhance prosody prediction. We that this visual information complements existing textual propose a novel model, VisualSpeech, which incorporates both features rather than being redundant. Finally, it reveals that integrating visual and textual information for improved prosody generation.
arXiv.org Artificial Intelligence
Jan-31-2025
- Country:
- Europe > United Kingdom (0.14)
- Genre:
- Research Report > New Finding (0.46)
- Technology:
- Information Technology > Artificial Intelligence
- Machine Learning > Neural Networks (0.68)
- Natural Language (0.87)
- Vision (1.00)
- Information Technology > Artificial Intelligence