Speech Synthesis


Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis

Neural Information Processing Systems

We describe a neural network-based system for text-to-speech (TTS) synthesis that is able to generate speech audio in the voice of many different speakers, including those unseen during training. Our system consists of three independently trained components: (1) a speaker encoder network, trained on a speaker verification task using an independent dataset of noisy speech from thousands of speakers without transcripts, to generate a fixed-dimensional embedding vector from seconds of reference speech from a target speaker; (2) a sequence-to-sequence synthesis network based on Tacotron 2, which generates a mel spectrogram from text, conditioned on the speaker embedding; (3) an auto-regressive WaveNet-based vocoder that converts the mel spectrogram into a sequence of time domain waveform samples. We demonstrate that the proposed model is able to transfer the knowledge of speaker variability learned by the discriminatively-trained speaker encoder to the new task, and is able to synthesize natural speech from speakers that were not seen during training. We quantify the importance of training the speaker encoder on a large and diverse speaker set in order to obtain the best generalization performance. Finally, we show that randomly sampled speaker embeddings can be used to synthesize speech in the voice of novel speakers dissimilar from those used in training, indicating that the model has learned a high quality speaker representation.


FastSpeech: Fast, Robust and Controllable Text to Speech

Neural Information Processing Systems

Neural network based end-to-end text to speech (TTS) has significantly improved the quality of synthesized speech. Prominent methods (e.g., Tacotron 2) usually first generate mel-spectrogram from text, and then synthesize speech from the mel-spectrogram using vocoder such as WaveNet. Compared with traditional concatenative and statistical parametric approaches, neural network based end-to-end models suffer from slow inference speed, and the synthesized speech is usually not robust (i.e., some words are skipped or repeated) and lack of controllability (voice speed or prosody control). In this work, we propose a novel feed-forward network based on Transformer to generate mel-spectrogram in parallel for TTS. Specifically, we extract attention alignments from an encoder-decoder based teacher model for phoneme duration prediction, which is used by a length regulator to expand the source phoneme sequence to match the length of the target mel-spectrogram sequence for parallel mel-spectrogram generation.


6 Ways Speech Synthesis Is Being Powered By Deep Learning

#artificialintelligence

This model was open sourced back in June 2019 as an implementation of the paper Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis. This service is being offered by Resemble.ai. With this product, one can clone any voice and create dynamic, iterable, and unique voice content. Users input a short voice sample and the model -- trained only during playback time -- can immediately deliver text-to-speech utterances in the style of the sampled voice. Bengaluru's Deepsync offers an Augmented Intelligence that learns the way you speak.


Audio samples from "Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis"

#artificialintelligence

Abstract: We describe a neural network-based system for text-to-speech (TTS) synthesis that is able to generate speech audio in the voice of many different speakers, including those unseen during training. Our system consists of three independently trained components: (1) a speaker encoder network, trained on a speaker verification task using an independent dataset of noisy speech from thousands of speakers without transcripts, to generate a fixed-dimensional embedding vector from seconds of reference speech from a target speaker; (2) a sequence-to-sequence synthesis network based on Tacotron 2, which generates a mel spectrogram from text, conditioned on the speaker embedding; (3) an auto-regressive WaveNet-based vocoder that converts the mel spectrogram into a sequence of time domain waveform samples. We demonstrate that the proposed model is able to transfer the knowledge of speaker variability learned by the discriminatively-trained speaker encoder to the new task, and is able to synthesize natural speech from speakers that were not seen during training. We quantify the importance of training the speaker encoder on a large and diverse speaker set in order to obtain the best generalization performance. Finally, we show that randomly sampled speaker embeddings can be used to synthesize speech in the voice of novel speakers dissimilar from those used in training, indicating that the model has learned a high quality speaker representation.


Disabled lawmaker first in Japan to use speech synthesizer during Diet session

The Japan Times

A lawmaker with severe physical disabilities attended his first parliamentary interpellation Thursday since being elected in July and became the first lawmaker in Japan ever to use an electronically-generated voice during a Diet session. In the session of the education, culture and science committee, Yasuhiko Funago, who has amyotrophic lateral sclerosis, a condition also known as Lou Gehrig's disease, greeted the committee using a speech synthesizer. He also asked questions through a proxy speaker. "As a newcomer, I am still inexperienced, but with everyone's assistance, I will do my best to tackle (issues)," he said at the beginning of the session. An aide then posed questions on his behalf and expressed his desire to see improvements in the learning environment for disabled children.


DeepMind Uses GANs to Convert Text to Speech

#artificialintelligence

Generative Adversarial Networks (GANs) have revolutionized high-fidelity image generation, making global headlines with their hyperrealistic portraits and content-swapping, while also raising concerns with convincing deepfake videos. Now, DeepMind researchers are expanding GANs to audio, with a new adversarial network approach for high fidelity speech synthesis. Text-to-Speech (TTS) is a process for converting text into a humanlike voice output. One of the most commonly used TTS network architectures is WaveNet, a neural autoregressive model for generating raw audio waveforms. But because WaveNet relies on the sequential generation of one audio sample at a time, it is poorly suited to today's massively parallel computers.


Who Uses Text to Speech (TTS) Anyway? - ReadSpeaker

#artificialintelligence

First things first: what is TTS? TTS or Text-to-Speech technology converts text into spoken speech. If you know Siri or those handy voice GPS directions on smartphones, then congratulations! Since 1000 AD, humans have strived to create synthetic speech, but it didn't enter the mainstream until the mid 1970s – early 1980s when computer operating systems began implementing it. Walt Tetschner, leader of the group that produced DECtalk in 1983, explains that while the voice wasn't perfect, it was still natural sounding and was used by companies such as MCI and Mtel (two-way paging).


Learn about the benefits of text to speech

#artificialintelligence

Every end user is a customer, and the quality of the customer journey is everything, regardless of whether the objective is purchasing a product or service or engaging in content fruition. End users can be website visitors, application, device, service, and machine users, online learners or teachers, and more. Text to speech allows content owners to respond to the different needs and desires of each user in terms of how they interact with the content.


Effect of choice of probability distribution, randomness, and search methods for alignment modeling in sequence-to-sequence text-to-speech synthesis using hard alignment

arXiv.org Machine Learning

EFFECT OF CHOICE OF PROBABILITY DISTRIBUTION, RANDOMNESS, AND SEARCH METHODS FOR ALIGNMENT MODELING IN SEQUENCE-TO-SEQUENCE TEXT -TO-SPEECH SYNTHESIS USING HARD ALIGNMENT Y usuke Y asuda 1, 2, Xin W ang 1, Junichi Y amagishi 1, 2 1 National Institute of Informatics, Japan 2 SOKENDAI, Japan yasuda@nii.ac.jp, wangxin@nii.ac.jp, jyamagis@nii.ac.jp ABSTRACT Sequence-to-sequence text-to-speech (TTS) is dominated by soft-attention-based methods. Recently, hard-attention-based methods have been proposed to prevent fatal alignment errors, but their sampling method of discrete alignment is poorly investigated. This research investigates various combinations of sampling methods and probability distributions for alignment transition modeling in a hard-alignment-based sequence-to-sequence TTS method called SSNT -TTS. We clarify the common sampling methods of discrete variables including greedy search, beam search, and random sampling from a Bernoulli distribution in a more general way. Furthermore, we introduce the binary Concrete distribution to model discrete variables more properly. The results of a listening test shows that deterministic search is more preferable than stochastic search, and the binary Concrete distribution is robust with stochastic search for natural alignment transition.


DeepMind Uses GANs to Convert Text to Speech

#artificialintelligence

Generative Adversarial Networks (GANs) have revolutionized high-fidelity image generation, making global headlines with their hyperrealistic portraits and content-swapping, while also raising concerns with convincing deepfake videos. Now, DeepMind researchers are expanding GANs to audio, with a new adversarial network approach for high fidelity speech synthesis. Text-to-Speech (TTS) is a process for converting text into a humanlike voice output. One of the most commonly used TTS network architectures is WaveNet, a neural autoregressive model for generating raw audio waveforms. But because WaveNet relies on the sequential generation of one audio sample at a time, it is poorly suited to today's massively parallel computers.