VisualSpeech: Enhance Prosody with Visual Context in TTS

Jan-31-2025–arXiv.org Artificial Intelligence

However, Text-to-Speech (TTS) synthesis faces the inherent challenge of no previous studies have explored the impact of visual information producing multiple speech outputs with varying prosody from on prosody in TTS. a single text input. While previous research has addressed this Thus, this paper explores the possibility of improving by predicting prosodic information from both text and speech, prosody prediction in TTS by means of visual information. It additional contextual information, such as visual features, remains makes three key contributions. First, it demonstrates that visual underutilized. This paper investigates the potential of cues carry valuable prosodic information. Second, it establishes integrating visual context to enhance prosody prediction. We that this visual information complements existing textual propose a novel model, VisualSpeech, which incorporates both features rather than being redundant. Finally, it reveals that integrating visual and textual information for improved prosody generation.

artificial intelligence, machine learning, visual feature, (15 more...)

arXiv.org Artificial Intelligence

Jan-31-2025

arXiv.org PDF

Add feedback

Country:
- Europe > United Kingdom (0.14)

Genre:
- Research Report > New Finding (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks (0.68)
  - Natural Language (0.87)
  - Vision (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found