Future of voice recognition: Assistants that learn from you

AITopics Original Links

Voice-activated assistants are playing an increasingly prominent role in the technology world, with Apple's introduction of Siri for the iPhone 4S and Google's (rumored) work on a Siri competitor for Android phones. Voice-activated technology isn't new--it's just getting better because of increasingly powerful processors and cloud services, advancements in natural language processing, and improved algorithms for recognizing voice. We spoke with Nuance Communications, maker of Dragon software and one of the biggest names in voice recognition technologies, about why voice is becoming more popular and what advancements we can expect in the future. Peter Mahoney, Nuance chief marketing officer and general manager of the Dragon desktop business, told Ars one of the most significant improvements coming in the next few years is a far more conversational voice-activated assistant that remembers everything you say. This should create better responses to casual questions.


Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis

Neural Information Processing Systems

We describe a neural network-based system for text-to-speech (TTS) synthesis that is able to generate speech audio in the voice of many different speakers, including those unseen during training. Our system consists of three independently trained components: (1) a speaker encoder network, trained on a speaker verification task using an independent dataset of noisy speech from thousands of speakers without transcripts, to generate a fixed-dimensional embedding vector from seconds of reference speech from a target speaker; (2) a sequence-to-sequence synthesis network based on Tacotron 2, which generates a mel spectrogram from text, conditioned on the speaker embedding; (3) an auto-regressive WaveNet-based vocoder that converts the mel spectrogram into a sequence of time domain waveform samples. We demonstrate that the proposed model is able to transfer the knowledge of speaker variability learned by the discriminatively-trained speaker encoder to the new task, and is able to synthesize natural speech from speakers that were not seen during training. We quantify the importance of training the speaker encoder on a large and diverse speaker set in order to obtain the best generalization performance. Finally, we show that randomly sampled speaker embeddings can be used to synthesize speech in the voice of novel speakers dissimilar from those used in training, indicating that the model has learned a high quality speaker representation.


Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis

Neural Information Processing Systems

We describe a neural network-based system for text-to-speech (TTS) synthesis that is able to generate speech audio in the voice of many different speakers, including those unseen during training. Our system consists of three independently trained components: (1) a speaker encoder network, trained on a speaker verification task using an independent dataset of noisy speech from thousands of speakers without transcripts, to generate a fixed-dimensional embedding vector from seconds of reference speech from a target speaker; (2) a sequence-to-sequence synthesis network based on Tacotron 2, which generates a mel spectrogram from text, conditioned on the speaker embedding; (3) an auto-regressive WaveNet-based vocoder that converts the mel spectrogram into a sequence of time domain waveform samples. We demonstrate that the proposed model is able to transfer the knowledge of speaker variability learned by the discriminatively-trained speaker encoder to the new task, and is able to synthesize natural speech from speakers that were not seen during training. We quantify the importance of training the speaker encoder on a large and diverse speaker set in order to obtain the best generalization performance. Finally, we show that randomly sampled speaker embeddings can be used to synthesize speech in the voice of novel speakers dissimilar from those used in training, indicating that the model has learned a high quality speaker representation.


The End of All Things Considered

Slate

The same principles apply: The audience for those shows is aging and declining, and news of all kinds is available at a pace that was unthinkable even 10 years ago. And as Olbermann pointed out, the desire for the "leisurely, well-done" show is very much on the wane. I executive-produced Weekend All Things Considered for four years. And while I loved the work, the constraints of the magazine format were frustrating. The time had to be filled, no matter the quality of the stories.


Are Microsoft And VocalZoom The Peanut Butter And Chocolate Of Voice Recognition?

#artificialintelligence

Moore's law has driven silicon chip circuitry to the point where we are surrounded by devices equipped with microprocessors. The devices are frequently wonderful; communicating with them – not so much. Pressing buttons on smart devices or keyboards is often clumsy and never the method of choice when effective voice communication is possible. The keyword in the previous sentence is "effective". Technology has advanced to the point where we are in the early stages of being able to communicate with our devices using voice recognition.