Voice Impression Control in Zero-Shot TTS
Fujita, Keinichi, Horiguchi, Shota, Ijima, Yusuke
–arXiv.org Artificial Intelligence
Para-/non-linguistic information in speech is pivotal in shaping the listeners' impression. Although zero-shot text-to-speech (TTS) has achieved high speaker fidelity, modulating subtle para-/non-linguistic information to control perceived voice characteristics, i.e., impressions, remains challenging. We have therefore developed a voice impression control method in zero-shot TTS that utilizes a low-dimensional vector to represent the intensities of various voice impression pairs (e.g., dark-bright). The results of both objective and subjective evaluations have demonstrated our method's effectiveness in impression control. Furthermore, generating this vector via a large language model enables target-impression generation from a natural language description of the desired impression, thus eliminating the need for manual optimization.
arXiv.org Artificial Intelligence
Jun-11-2025
- Country:
- Asia > Japan > Honshū
- Chūbu
- Ishikawa Prefecture > Kanazawa (0.04)
- Nagano Prefecture > Nagano (0.04)
- Kantō > Kanagawa Prefecture (0.04)
- Chūbu
- Asia > Japan > Honshū
- Genre:
- Research Report > New Finding (0.46)
- Industry:
- Information Technology > Services (0.41)
- Technology: