EmoSphere-TTS: Emotional Style and Intensity Modeling via Spherical Emotion Vector for Controllable Emotional Text-to-Speech
Cho, Deok-Hyeon, Oh, Hyung-Seok, Kim, Seung-Bin, Lee, Sang-Hoon, Lee, Seong-Whan
–arXiv.org Artificial Intelligence
Despite rapid advances in the field of emotional text-to-speech (TTS), recent studies primarily focus on mimicking the average style of a particular emotion. As a result, the ability to manipulate speech emotion remains constrained to several predefined labels, compromising the ability to reflect the nuanced variations of emotion. In this paper, we propose EmoSphere-TTS, which synthesizes expressive emotional speech by using a spherical emotion vector to control the emotional style and intensity of the synthetic speech. Without any human annotation, we use the arousal, valence, and dominance pseudo-labels to model the complex nature of emotion via a Cartesian-spherical transformation. Furthermore, we propose a dual conditional adversarial network to improve the quality of generated speech by reflecting the multi-aspect characteristics. The experimental results demonstrate the model ability to control emotional style and intensity with high-quality expressive speech.
arXiv.org Artificial Intelligence
Jun-11-2024
- Country:
- Europe > Italy
- Calabria > Catanzaro Province > Catanzaro (0.04)
- Asia
- South Korea > Seoul
- Seoul (0.04)
- Japan > Honshū
- Kantō > Tokyo Metropolis Prefecture > Tokyo (0.04)
- South Korea > Seoul
- Europe > Italy
- Genre:
- Research Report > New Finding (0.48)
- Technology:
- Information Technology > Artificial Intelligence
- Machine Learning (1.00)
- Cognitive Science > Emotion (0.70)
- Vision > Optical Character Recognition (0.62)
- Speech
- Speech Synthesis (0.74)
- Speech Recognition (0.47)
- Information Technology > Artificial Intelligence