Automatic Tuning of Loss Trade-offs without Hyper-parameter Search in End-to-End Zero-Shot Speech Synthesis

Park, Seongyeon, Kim, Bohyung, Oh, Tae-hyun

arXiv.org Artificial Intelligence 

Recently, zero-shot TTS and VC methods have gained attention due to their practicality of being able to generate voices even unseen during training. Among these methods, zero-shot modifications of the VITS model have shown superior performance, while having useful properties inherited from VITS. However, the performance of VITS and VITS-based zero-shot models vary dramatically depending on how the losses are balanced. This can be problematic, as it requires a burdensome procedure of tuning loss balance hyper-parameters to find the optimal balance. In this work, we propose a novel framework that finds this optimum Figure 1: Word Error Rate of speech synthesized through VC without search, by inducing the decoder of VITS-based models or TTS, from VITS and YourTTS [20] according to the loss to its full reconstruction ability. With our framework, we show balance hyper-parameter α (the loss weight parameter of the superior performance compared to baselines in zero-shot TTS reconstruction loss). Both axes are in log scale. The WER has a and VC, achieving state-of-the-art performance.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found