MM-TTS: A Unified Framework for Multimodal, Prompt-Induced Emotional Text-to-Speech Synthesis

Li, Xiang, Cheng, Zhi-Qi, He, Jun-Yan, Peng, Xiaojiang, Hauptmann, Alexander G.

Apr-28-2024–arXiv.org Artificial Intelligence

Emotional Text-to-Speech (E-TTS) synthesis has gained significant attention in recent years due to its potential to enhance human-computer interaction. However, current E-TTS approaches often struggle to capture the complexity of human emotions, primarily relying on oversimplified emotional labels or single-modality inputs. To address these limitations, we propose the Multimodal Emotional Text-to-Speech System (MM-TTS), a unified framework that leverages emotional cues from multiple modalities to generate highly expressive and emotionally resonant speech. MM-TTS consists of two key components: (1) the Emotion Prompt Alignment Module (EP-Align), which employs contrastive learning to align emotional features across text, audio, and visual modalities, ensuring a coherent fusion of multimodal information; and (2) the Emotion Embedding-Induced TTS (EMI-TTS), which integrates the aligned emotional embeddings with state-of-the-art TTS models to synthesize speech that accurately reflects the intended emotions. Extensive evaluations across diverse datasets demonstrate the superior performance of MM-TTS compared to traditional E-TTS models. Objective metrics, including Word Error Rate (WER) and Character Error Rate (CER), show significant improvements on ESD dataset, with MM-TTS achieving scores of 7.35% and 3.07%, respectively. Subjective assessments further validate that MM-TTS generates speech with emotional fidelity and naturalness comparable to human speech. Our code and pre-trained models are publicly available at https://anonymous.4open.science/r/MMTTS-D214

emotion, mm-tts, speech, (16 more...)

arXiv.org Artificial Intelligence

Apr-28-2024

arXiv.org PDF

Add feedback

Country:
- North America
  - United States
    - District of Columbia > Washington (0.05)
    - Pennsylvania > Allegheny County
      - Pittsburgh (0.04)
  - Canada
    - Quebec > Montreal (0.04)
    - British Columbia > Metro Vancouver Regional District
      - Vancouver (0.04)
- Europe > Italy
  - Calabria > Catanzaro Province > Catanzaro (0.04)
- Asia > China
  - Guangdong Province > Shenzhen (0.04)

Genre:
- Research Report (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Vision (1.00)
  - Speech > Speech Synthesis (1.00)
  - Cognitive Science > Emotion (0.89)
  - Machine Learning
    - Neural Networks > Deep Learning (0.93)
    - Performance Analysis > Accuracy (0.69)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found