Automatic Tuning of Loss Trade-offs without Hyper-parameter Search in End-to-End Zero-Shot Speech Synthesis

Park, Seongyeon, Kim, Bohyung, Oh, Tae-hyun

May-26-2023–arXiv.org Artificial Intelligence

Recently, zero-shot TTS and VC methods have gained attention due to their practicality of being able to generate voices even unseen during training. Among these methods, zero-shot modifications of the VITS model have shown superior performance, while having useful properties inherited from VITS. However, the performance of VITS and VITS-based zero-shot models vary dramatically depending on how the losses are balanced. This can be problematic, as it requires a burdensome procedure of tuning loss balance hyper-parameters to find the optimal balance. In this work, we propose a novel framework that finds this optimum Figure 1: Word Error Rate of speech synthesized through VC without search, by inducing the decoder of VITS-based models or TTS, from VITS and YourTTS [20] according to the loss to its full reconstruction ability. With our framework, we show balance hyper-parameter α (the loss weight parameter of the superior performance compared to baselines in zero-shot TTS reconstruction loss). Both axes are in log scale. The WER has a and VC, achieving state-of-the-art performance.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

May-26-2023

arXiv.org PDF

Add feedback

Country:
- Europe > France
  - Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)
- Asia > Japan
  - Honshū > Kansai > Osaka Prefecture > Osaka (0.04)

Genre:
- Research Report (0.82)

Technology:
- Information Technology > Artificial Intelligence
  - Speech (1.00)
  - Natural Language > Large Language Model (1.00)
  - Machine Learning (1.00)
  - Representation & Reasoning (0.93)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found