Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis

Neural Information Processing Systems

We describe a neural network-based system for text-to-speech (TTS) synthesis that is able to generate speech audio in the voice of many different speakers, including those unseen during training. Our system consists of three independently trained components: (1) a speaker encoder network, trained on a speaker verification task using an independent dataset of noisy speech from thousands of speakers without transcripts, to generate a fixed-dimensional embedding vector from seconds of reference speech from a target speaker; (2) a sequence-to-sequence synthesis network based on Tacotron 2, which generates a mel spectrogram from text, conditioned on the speaker embedding; (3) an auto-regressive WaveNet-based vocoder that converts the mel spectrogram into a sequence of time domain waveform samples. We demonstrate that the proposed model is able to transfer the knowledge of speaker variability learned by the discriminatively-trained speaker encoder to the new task, and is able to synthesize natural speech from speakers that were not seen during training. We quantify the importance of training the speaker encoder on a large and diverse speaker set in order to obtain the best generalization performance. Finally, we show that randomly sampled speaker embeddings can be used to synthesize speech in the voice of novel speakers dissimilar from those used in training, indicating that the model has learned a high quality speaker representation.


Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis

Neural Information Processing Systems

We describe a neural network-based system for text-to-speech (TTS) synthesis that is able to generate speech audio in the voice of many different speakers, including those unseen during training. Our system consists of three independently trained components: (1) a speaker encoder network, trained on a speaker verification task using an independent dataset of noisy speech from thousands of speakers without transcripts, to generate a fixed-dimensional embedding vector from seconds of reference speech from a target speaker; (2) a sequence-to-sequence synthesis network based on Tacotron 2, which generates a mel spectrogram from text, conditioned on the speaker embedding; (3) an auto-regressive WaveNet-based vocoder that converts the mel spectrogram into a sequence of time domain waveform samples. We demonstrate that the proposed model is able to transfer the knowledge of speaker variability learned by the discriminatively-trained speaker encoder to the new task, and is able to synthesize natural speech from speakers that were not seen during training. We quantify the importance of training the speaker encoder on a large and diverse speaker set in order to obtain the best generalization performance. Finally, we show that randomly sampled speaker embeddings can be used to synthesize speech in the voice of novel speakers dissimilar from those used in training, indicating that the model has learned a high quality speaker representation.


Are Microsoft And VocalZoom The Peanut Butter And Chocolate Of Voice Recognition?

#artificialintelligence

Moore's law has driven silicon chip circuitry to the point where we are surrounded by devices equipped with microprocessors. The devices are frequently wonderful; communicating with them – not so much. Pressing buttons on smart devices or keyboards is often clumsy and never the method of choice when effective voice communication is possible. The keyword in the previous sentence is "effective". Technology has advanced to the point where we are in the early stages of being able to communicate with our devices using voice recognition.


Are Voice Recognition Based Payments The Next Step in FinTech Convenience? - FindBiometrics

#artificialintelligence

PayPal may be looking into voice recognition to enable more digital commerce use cases in the near future, if a new post-MWC blog post offers any hints. Looking back on last week's event -- for which we featured extensive firsthand coverage -- PayPal Head of Global Initiatives Anuj Nayar notes two dominant trends. One is the Internet of Things, including new connected car technologies like PayPal's new car commerce feature with Shell and Jaguar (and Apple). The other, as Nayar puts it, is "conversational commerce." Looking at emerging digital commerce opportunities in areas like virtual reality, connected appliances, and even drones, Nayar asserts that it "won't be convenient or realistic to pull out a credit card or punch in your information in any of these scenarios".


Detroit auto show: Tech giants take the wheel in voice recognition systems

USATODAY - Tech Top Stories

A link has been posted to your Facebook feed. Can you fix voice recognition in new cars? After years of designing their own often-faulty voice recognition systems, auto companies are handing the reins over to tech giants that have already developed the technology for their devices. The trend is on full display at the 2019 Detroit auto show, where automakers are showcasing new vehicles with increasingly common systems that allow drivers to plug in their phones and bypass built-in infotainment systems. Using spoken commands to tune the radio, make a call or get directions has required patience, awkward pronunciation and frequent do-overs ever since it became possible in some vehicles earlier this century.