Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis

Neural Information Processing Systems

We describe a neural network-based system for text-to-speech (TTS) synthesis that is able to generate speech audio in the voice of many different speakers, including those unseen during training. Our system consists of three independently trained components: (1) a speaker encoder network, trained on a speaker verification task using an independent dataset of noisy speech from thousands of speakers without transcripts, to generate a fixed-dimensional embedding vector from seconds of reference speech from a target speaker; (2) a sequence-to-sequence synthesis network based on Tacotron 2, which generates a mel spectrogram from text, conditioned on the speaker embedding; (3) an auto-regressive WaveNet-based vocoder that converts the mel spectrogram into a sequence of time domain waveform samples. We demonstrate that the proposed model is able to transfer the knowledge of speaker variability learned by the discriminatively-trained speaker encoder to the new task, and is able to synthesize natural speech from speakers that were not seen during training. We quantify the importance of training the speaker encoder on a large and diverse speaker set in order to obtain the best generalization performance. Finally, we show that randomly sampled speaker embeddings can be used to synthesize speech in the voice of novel speakers dissimilar from those used in training, indicating that the model has learned a high quality speaker representation.


Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis

Neural Information Processing Systems

We describe a neural network-based system for text-to-speech (TTS) synthesis that is able to generate speech audio in the voice of many different speakers, including those unseen during training. Our system consists of three independently trained components: (1) a speaker encoder network, trained on a speaker verification task using an independent dataset of noisy speech from thousands of speakers without transcripts, to generate a fixed-dimensional embedding vector from seconds of reference speech from a target speaker; (2) a sequence-to-sequence synthesis network based on Tacotron 2, which generates a mel spectrogram from text, conditioned on the speaker embedding; (3) an auto-regressive WaveNet-based vocoder that converts the mel spectrogram into a sequence of time domain waveform samples. We demonstrate that the proposed model is able to transfer the knowledge of speaker variability learned by the discriminatively-trained speaker encoder to the new task, and is able to synthesize natural speech from speakers that were not seen during training. We quantify the importance of training the speaker encoder on a large and diverse speaker set in order to obtain the best generalization performance. Finally, we show that randomly sampled speaker embeddings can be used to synthesize speech in the voice of novel speakers dissimilar from those used in training, indicating that the model has learned a high quality speaker representation.


Are Microsoft And VocalZoom The Peanut Butter And Chocolate Of Voice Recognition?

#artificialintelligence

Moore's law has driven silicon chip circuitry to the point where we are surrounded by devices equipped with microprocessors. The devices are frequently wonderful; communicating with them – not so much. Pressing buttons on smart devices or keyboards is often clumsy and never the method of choice when effective voice communication is possible. The keyword in the previous sentence is "effective". Technology has advanced to the point where we are in the early stages of being able to communicate with our devices using voice recognition.


Speaker Verification Using Speaker-Specific Prompts

AAAI Conferences

Intra-and inter-speaker information, which include acoustical, speaker style, speech rate and temporal variation, despite their critical importance for the verification of claims, still have not been captured effectively. As a result of such modeling deficiency, existing speaker verification systems generally test claimed utterances with interfacing procedures that are common to all speakers. In this paper, a novel method is introduced in which speaker-specific attributes are expressed with reliable, first and second order intra-speaker and inter-speaker statistical information on the output space of speaker models in an explicit way. This is achieved through the computation of the Speech Unit Confusion Matrix (SUCM) that is employed in the scoring phase. An online updating procedure of SUCM is also presented. Experimental results with spoken alphabetic characters used as the basic speech unit indicate that the new method can improve system performance significantly. The method can also be directly extended to the use of other speech units (phonemes, sub-words, digits).


Generative x-vectors for text-independent speaker verification

arXiv.org Machine Learning

Speaker verification (SV) systems using deep neural network embeddings, so-called the x-vector systems, are becoming popular due to its good performance superior to the i-vector systems. The fusion of these systems provides improved performance benefiting both from the discriminatively trained x-vectors and generative i-vectors capturing distinct speaker characteristics. In this paper, we propose a novel method to include the complementary information of i-vector and x-vector, that is called generative x-vector. The generative x-vector utilizes a transformation model learned from the i-vector and x-vector representations of the background data. Canonical correlation analysis is applied to derive this transformation model, which is later used to transform the standard x-vectors of the enrollment and test segments to the corresponding generative x-vectors. The SV experiments performed on the NIST SRE 2010 dataset demonstrate that the system using generative x-vectors provides considerably better performance than the baseline i-vector and x-vector systems. Furthermore, the generative x-vectors outperform the fusion of i-vector and x-vector systems for long-duration utterances, while yielding comparable results for short-duration utterances.