MM-TTS: A Unified Framework for Multimodal, Prompt-Induced Emotional Text-to-Speech Synthesis
Li, Xiang, Cheng, Zhi-Qi, He, Jun-Yan, Peng, Xiaojiang, Hauptmann, Alexander G.
–arXiv.org Artificial Intelligence
Emotional Text-to-Speech (E-TTS) synthesis has gained significant attention in recent years due to its potential to enhance human-computer interaction. However, current E-TTS approaches often struggle to capture the complexity of human emotions, primarily relying on oversimplified emotional labels or single-modality inputs. To address these limitations, we propose the Multimodal Emotional Text-to-Speech System (MM-TTS), a unified framework that leverages emotional cues from multiple modalities to generate highly expressive and emotionally resonant speech. MM-TTS consists of two key components: (1) the Emotion Prompt Alignment Module (EP-Align), which employs contrastive learning to align emotional features across text, audio, and visual modalities, ensuring a coherent fusion of multimodal information; and (2) the Emotion Embedding-Induced TTS (EMI-TTS), which integrates the aligned emotional embeddings with state-of-the-art TTS models to synthesize speech that accurately reflects the intended emotions. Extensive evaluations across diverse datasets demonstrate the superior performance of MM-TTS compared to traditional E-TTS models. Objective metrics, including Word Error Rate (WER) and Character Error Rate (CER), show significant improvements on ESD dataset, with MM-TTS achieving scores of 7.35% and 3.07%, respectively. Subjective assessments further validate that MM-TTS generates speech with emotional fidelity and naturalness comparable to human speech. Our code and pre-trained models are publicly available at https://anonymous.4open.science/r/MMTTS-D214
arXiv.org Artificial Intelligence
Apr-28-2024
- Country:
- Asia > China
- Guangdong Province (0.14)
- North America > United States
- Pennsylvania (0.14)
- Asia > China
- Genre:
- Research Report (1.00)
- Technology:
- Information Technology > Artificial Intelligence
- Cognitive Science > Emotion (0.89)
- Machine Learning
- Neural Networks > Deep Learning (0.93)
- Performance Analysis > Accuracy (0.69)
- Speech > Speech Synthesis (1.00)
- Vision (1.00)
- Information Technology > Artificial Intelligence