Discrete Acoustic Space for an Efficient Sampling in Neural Text-To-Speech

Strong, Marek, Rohnke, Jonas, Bonafonte, Antonio, Łajszczak, Mateusz, Wood, Trevor

Sep-14-2023–arXiv.org Artificial Intelligence

We present a Split Vector Quantized Variational Autoencoder (SVQ-VAE) architecture using a split vector quantizer for NTTS, as an enhancement to the well-known Variational Autoencoder (VAE) and Vector Quantized Variational Autoencoder (VQ-VAE) architectures. Compared to these previous architectures, our proposed model retains the benefits of using an utterance-level bottleneck, while keeping significant representation power and a discretized latent space small enough for efficient prediction from text. We train the model on recordings in the expressive task-oriented dialogues domain and show that SVQ-VAE achieves a statistically significant improvement in naturalness over the VAE and VQ-VAE models. Furthermore, we demonstrate that the SVQ-VAE latent acoustic space is predictable from text, reducing the gap between the standard constant vector synthesis and vocoded recordings by 32%.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

Sep-14-2023

arXiv.org PDF

Add feedback

Genre:
- Research Report > Experimental Study (0.47)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks (1.00)
  - Natural Language (1.00)
  - Speech (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found