Using VAEs and Normalizing Flows for One-shot Text-To-Speech Synthesis of Expressive Speech

Aggarwal, Vatsal, Cotescu, Marius, Prateek, Nishant, Lorenzo-Trueba, Jaime, Barra-Chicote, Roberto

Nov-28-2019–arXiv.org Machine Learning

We propose a Text-to-Speech method to create an unseen expressive style using one utterance of expressive speech of around one second. Specifically, we enhance the disentanglement capabilities of a state-of-the-art sequence-to-sequence based system with a Variational AutoEncoder (VAE) and a Householder Flow. The proposed system provides a 22% KL-divergence reduction while jointly improving perceptual metrics over state-of-the-art. At synthesis time we use one example of expressive style as a reference input to the encoder for generating any text in the desired style. Perceptual MUSHRA evaluations show that we can create a voice with a 9% relative naturalness improvement over standard Neural Text-to-Speech, while also improving the perceived emotional intensity (59 compared to the 55 of neutral speech).

architecture, emotional strength, naturalness, (12 more...)

arXiv.org Machine Learning

Nov-28-2019

arXiv.org PDF

Add feedback

Country:
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre:
- Research Report (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Speech > Speech Synthesis (1.00)
  - Vision > Optical Character Recognition (0.92)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found