Speech Synthesis


FastSpeech: Fast, Robust and Controllable Text to Speech

Neural Information Processing Systems

Neural network based end-to-end text to speech (TTS) has significantly improved the quality of synthesized speech. Prominent methods (e.g., Tacotron 2) usually first generate mel-spectrogram from text, and then synthesize speech from the mel-spectrogram using vocoder such as WaveNet. Compared with traditional concatenative and statistical parametric approaches, neural network based end-to-end models suffer from slow inference speed, and the synthesized speech is usually not robust (i.e., some words are skipped or repeated) and lack of controllability (voice speed or prosody control). In this work, we propose a novel feed-forward network based on Transformer to generate mel-spectrogram in parallel for TTS. Specifically, we extract attention alignments from an encoder-decoder based teacher model for phoneme duration prediction, which is used by a length regulator to expand the source phoneme sequence to match the length of the target mel-spectrogram sequence for parallel mel-spectrogram generation.



Deep Voice 2: Multi-Speaker Neural Text-to-Speech

Neural Information Processing Systems

We introduce a technique for augmenting neural text-to-speech (TTS) with low-dimensional trainable speaker embeddings to generate different voices from a single model. As a starting point, we show improvements over the two state-of-the-art approaches for single-speaker neural TTS: Deep Voice 1 and Tacotron. We introduce Deep Voice 2, which is based on a similar pipeline with Deep Voice 1, but constructed with higher performance building blocks and demonstrates a significant audio quality improvement over Deep Voice 1. We improve Tacotron by introducing a post-processing neural vocoder, and demonstrate a significant audio quality improvement. We then demonstrate our technique for multi-speaker speech synthesis for both Deep Voice 2 and Tacotron on two multi-speaker TTS datasets.


Fully-hierarchical fine-grained prosody modeling for interpretable speech synthesis

arXiv.org Machine Learning

This paper proposes a hierarchical, fine-grained and interpretable latent variable model for prosody based on the Tacotron 2 text-to-speech model. It achieves multi-resolution modeling of prosody by conditioning finer level representations on coarser level ones. Additionally, it imposes hierarchical conditioning across all latent dimensions using a conditional variational auto-encoder (VAE) with an auto-regressive structure. Evaluation of reconstruction performance illustrates that the new structure does not degrade the model while allowing better interpretability. Interpretations of prosody attributes are provided together with the comparison between word-level and phone-level prosody representations. Moreover, both qualitative and quantitative evaluations are used to demonstrate the improvement in the disentanglement of the latent dimensions.


BOFFIN TTS: Few-Shot Speaker Adaptation by Bayesian Optimization

arXiv.org Machine Learning

We present BOFFIN TTS (Bayesian Optimization For FIne-tuning Neural Text To Speech), a novel approach for few-shot speaker adaptation. Here, the task is to fine-tune a pre-trained TTS model to mimic a new speaker using a small corpus of target utterances. We demonstrate that there does not exist a one-size-fits-all adaptation strategy, with convincing synthesis requiring a corpus-specific configuration of the hyper-parameters that control fine-tuning. By using Bayesian optimization to efficiently optimize these hyper-parameter values for a target speaker, we are able to perform adaptation with an average 30% improvement in speaker similarity over standard techniques. Results indicate, across multiple corpora, that BOFFIN TTS can learn to synthesize new speakers using less than ten minutes of audio, achieving the same naturalness as produced for the speakers used to train the base model.


Audio Sample from SqueezeWave

#artificialintelligence

Automatic speech synthesis is a challenging task that is becoming increasingly important as edge devices begin to interact with users through speech. Typical text-to-speech pipelines include a vocoder, which translates intermediate audio representations into an audio waveform. Most existing vocoders are difficult to parallelize since each generated sample is conditioned on previous samples. WaveGlow is a flow-based feed-forward alternative to these auto-regressive models (Prenger et al., 2019). However, while WaveGlow can be easily parallelized, the model is too expensive for real-time speech synthesis on the edge.


tianrengao/SqueezeWave

#artificialintelligence

Automatic speech synthesis is a challenging task that is becoming increasingly important as edge devices begin to interact with users through speech. Typical text-to-speech pipelines include a vocoder, which translates intermediate audio representations into an audio waveform. Most existing vocoders are difficult to parallelize since each generated sample is conditioned on previous samples. WaveGlow is a flow-based feed-forward alternative to these auto-regressive models (Prenger et al., 2019). However, while WaveGlow can be easily parallelized, the model is too expensive for real-time speech synthesis on the edge.


DeepMind and Google recreate former NFL linebacker Tim Shaw's voice using AI

#artificialintelligence

In August, Google AI researchers working with the ALS Therapy Development Institute shared details about Project Euphonia, a speech-to-text transcription service for people with speaking impairments. They showed that, using data sets of audio from both native and non-native English speakers with neurodegenerative diseases and techniques from Parrotron, an AI tool for people with impediments, they could drastically improve the quality of speech synthesis and generation. Recently, in something of a case study, Google researchers and a team from Alphabet's DeepMind employed Euphonia in an effort to recreate the original voice of Tim Shaw, a former NFL football linebacker who played for the Carolina Panthers, Jacksonville Jaguars, Chicago Bears, and Tennessee Titans before retiring in 2013. Roughly six years ago, Shaw was diagnosed with ALS, which requires him to use a wheelchair and left him unable to speak, swallow, or breathe without assistance. Over the course of six months, the joint research team adapted a generative AI model -- WaveNet -- to the task of synthesizing speech from samples of Shaw's voice prior to his ALS diagnoses.


Using VAEs and Normalizing Flows for One-shot Text-To-Speech Synthesis of Expressive Speech

arXiv.org Machine Learning

ABSTRACT We propose a Text-to-Speech method to create an unseen expressive style using one utterance of expressive speech of around one second. Specifically, we enhance the disentanglement capabilities of a state-of-the-art sequence-to-sequence based system with a V ariational AutoEncoder (V AE) and a Householder Flow. The proposed system provides a 22% KL-divergence reduction while jointly improving perceptual metrics over state-of-the-art. At synthesis time we use one example of expressive style as a reference input to the encoder for generating any text in the desired style. Perceptual MUSHRA evaluations show that we can create a voice with a 9% relative naturalness improvement over standard Neural Text-to- Speech, while also improving the perceived emotional intensity ( 59 compared to the 55 of neutral speech).


6 Ways Speech Synthesis Is Being Powered By Deep Learning

#artificialintelligence

This model was open sourced back in June 2019 as an implementation of the paper Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis. This service is being offered by Resemble.ai. With this product, one can clone any voice and create dynamic, iterable, and unique voice content. Users input a short voice sample and the model -- trained only during playback time -- can immediately deliver text-to-speech utterances in the style of the sampled voice. Bengaluru's Deepsync offers an Augmented Intelligence that learns the way you speak.