EmoSphere-TTS: Emotional Style and Intensity Modeling via Spherical Emotion Vector for Controllable Emotional Text-to-Speech

Cho, Deok-Hyeon, Oh, Hyung-Seok, Kim, Seung-Bin, Lee, Sang-Hoon, Lee, Seong-Whan

Jun-11-2024–arXiv.org Artificial Intelligence

Despite rapid advances in the field of emotional text-to-speech (TTS), recent studies primarily focus on mimicking the average style of a particular emotion. As a result, the ability to manipulate speech emotion remains constrained to several predefined labels, compromising the ability to reflect the nuanced variations of emotion. In this paper, we propose EmoSphere-TTS, which synthesizes expressive emotional speech by using a spherical emotion vector to control the emotional style and intensity of the synthetic speech. Without any human annotation, we use the arousal, valence, and dominance pseudo-labels to model the complex nature of emotion via a Cartesian-spherical transformation. Furthermore, we propose a dual conditional adversarial network to improve the quality of generated speech by reflecting the multi-aspect characteristics. The experimental results demonstrate the model ability to control emotional style and intensity with high-quality expressive speech.

emotion, intensity, speech, (14 more...)

arXiv.org Artificial Intelligence

Jun-11-2024

arXiv.org PDF

Add feedback

Country:
- Europe > Italy
  - Calabria > Catanzaro Province > Catanzaro (0.04)
- Asia
  - South Korea > Seoul
    - Seoul (0.04)
  - Japan > Honshū
    - Kantō > Tokyo Metropolis Prefecture > Tokyo (0.04)

Genre:
- Research Report > New Finding (0.48)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning (1.00)
  - Cognitive Science > Emotion (0.70)
  - Vision > Optical Character Recognition (0.62)
  - Speech
    - Speech Synthesis (0.74)
    - Speech Recognition (0.47)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found