Automatic Tuning of Loss Trade-offs without Hyper-parameter Search in End-to-End Zero-Shot Speech Synthesis
Park, Seongyeon, Kim, Bohyung, Oh, Tae-hyun
–arXiv.org Artificial Intelligence
Recently, zero-shot TTS and VC methods have gained attention due to their practicality of being able to generate voices even unseen during training. Among these methods, zero-shot modifications of the VITS model have shown superior performance, while having useful properties inherited from VITS. However, the performance of VITS and VITS-based zero-shot models vary dramatically depending on how the losses are balanced. This can be problematic, as it requires a burdensome procedure of tuning loss balance hyper-parameters to find the optimal balance. In this work, we propose a novel framework that finds this optimum Figure 1: Word Error Rate of speech synthesized through VC without search, by inducing the decoder of VITS-based models or TTS, from VITS and YourTTS [20] according to the loss to its full reconstruction ability. With our framework, we show balance hyper-parameter α (the loss weight parameter of the superior performance compared to baselines in zero-shot TTS reconstruction loss). Both axes are in log scale. The WER has a and VC, achieving state-of-the-art performance.
arXiv.org Artificial Intelligence
May-26-2023
- Country:
- Asia > Japan
- Honshū > Kansai > Osaka Prefecture > Osaka (0.04)
- Europe > France
- Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)
- Asia > Japan
- Genre:
- Research Report (0.82)
- Technology:
- Information Technology > Artificial Intelligence
- Machine Learning (1.00)
- Natural Language > Large Language Model (1.00)
- Representation & Reasoning (0.93)
- Speech (1.00)
- Information Technology > Artificial Intelligence