We describe a neural network-based system for text-to-speech (TTS) synthesis that is able to generate speech audio in the voice of many different speakers, including those unseen during training. Our system consists of three independently trained components: (1) a speaker encoder network, trained on a speaker verification task using an independent dataset of noisy speech from thousands of speakers without transcripts, to generate a fixed-dimensional embedding vector from seconds of reference speech from a target speaker; (2) a sequence-to-sequence synthesis network based on Tacotron 2, which generates a mel spectrogram from text, conditioned on the speaker embedding; (3) an auto-regressive WaveNet-based vocoder that converts the mel spectrogram into a sequence of time domain waveform samples. We demonstrate that the proposed model is able to transfer the knowledge of speaker variability learned by the discriminatively-trained speaker encoder to the new task, and is able to synthesize natural speech from speakers that were not seen during training. We quantify the importance of training the speaker encoder on a large and diverse speaker set in order to obtain the best generalization performance. Finally, we show that randomly sampled speaker embeddings can be used to synthesize speech in the voice of novel speakers dissimilar from those used in training, indicating that the model has learned a high quality speaker representation.
In August, Google AI researchers working with the ALS Therapy Development Institute shared details about Project Euphonia, a speech-to-text transcription service for people with speaking impairments. They showed that, using data sets of audio from both native and non-native English speakers with neurodegenerative diseases and techniques from Parrotron, an AI tool for people with impediments, they could drastically improve the quality of speech synthesis and generation. Recently, in something of a case study, Google researchers and a team from Alphabet's DeepMind employed Euphonia in an effort to recreate the original voice of Tim Shaw, a former NFL football linebacker who played for the Carolina Panthers, Jacksonville Jaguars, Chicago Bears, and Tennessee Titans before retiring in 2013. Roughly six years ago, Shaw was diagnosed with ALS, which requires him to use a wheelchair and left him unable to speak, swallow, or breathe without assistance. Over the course of six months, the joint research team adapted a generative AI model -- WaveNet -- to the task of synthesizing speech from samples of Shaw's voice prior to his ALS diagnoses.
Tracking the health of underwater species is critical to understanding the effects of climate change on marine ecosystems. Unfortunately, it's a time-consuming process -- biologists conduct studies with echosounders that use sonar to determine water and object depth, and they manually interpret the resulting 2D echograms. These interpretations are often prone to error and require pricey software like Echoview. Fortunately, a team of research scientists hailing from the University of Victoria in Canada are developing a machine learning method for detecting specific biological targets in acoustic survey data. In a preprint paper ("A Deep Learning based Framework for the Detection of Schools of Herring in Echograms"), they say that their approach -- which they tested on schools of herring -- might measurably improve the accuracy of environmental monitoring.
Moore's law has driven silicon chip circuitry to the point where we are surrounded by devices equipped with microprocessors. The devices are frequently wonderful; communicating with them – not so much. Pressing buttons on smart devices or keyboards is often clumsy and never the method of choice when effective voice communication is possible. The keyword in the previous sentence is "effective". Technology has advanced to the point where we are in the early stages of being able to communicate with our devices using voice recognition.
Intra-and inter-speaker information, which include acoustical, speaker style, speech rate and temporal variation, despite their critical importance for the verification of claims, still have not been captured effectively. As a result of such modeling deficiency, existing speaker verification systems generally test claimed utterances with interfacing procedures that are common to all speakers. In this paper, a novel method is introduced in which speaker-specific attributes are expressed with reliable, first and second order intra-speaker and inter-speaker statistical information on the output space of speaker models in an explicit way. This is achieved through the computation of the Speech Unit Confusion Matrix (SUCM) that is employed in the scoring phase. An online updating procedure of SUCM is also presented. Experimental results with spoken alphabetic characters used as the basic speech unit indicate that the new method can improve system performance significantly. The method can also be directly extended to the use of other speech units (phonemes, sub-words, digits).