Synthetic-speech researchers ... have been tackling a much tougher challenge: making computers say anything a live person could say, and in a voice that sounds natural.
– from Making Computers Talk. Andy Aaron, Ellen Eide and John F. Pitrelli. Scientific American Explore (March 17, 2003)
WaveNet used machine learning to build a voice sample by sample, and the results were, as I put it then, "eerily convincing." The general idea behind the tech was to recreate words and sentences not by coding grammatical and tonal rules manually, but allowing a machine learning system to see those patterns in speech and generate them sample by sample. The new, improved WaveNet generates sound at 20x real time -- generating the same two-second clip in a tenth of a second. In keeping with the trend of "big tech companies doing what the other big tech companies are doing," Apple, too, recently revamped its assistant (Siri, don't you know) with a machine learning-powered speech model.
Considered the first electrical speech synthesizer, VODER (Voice Operation DEmonstratoR) was developed by Homer Dudley at Bell Labs and demonstrated at both the 1939 New York World's Fair and the 1939 Golden Gate International Exposition. Difficult to use and difficult to operate, VODER nonetheless paved the way for future machine-generated speech.
The work will have a particular focus on the development of structured acoustic models which take account of factors such as accent and speaking style, and on the development of machine learning techniques for vocoding. You will have the necessary programming ability to conduct research in this area, a background in statistical modeling using Hidden Markov Models, DNN, RNN, speech signal processing, and research experience in speech synthesis. A background in one or more of the following areas is also desirable: statistical parametric text-to-speech synthesis using HMMs and HSMMs; glottal source modeling; speech signal modeling; speaker adaptation using the MLLR or MAP family of techniques; familiarity with software tools including DNN, Deep Learning, RNN, HTK, HTS, Festival; and familiarity with modern machine learning. Develop and extend speech synthesis technologies in Oben's proprietary speech synthesis system, in view of the realization of prosody and voice quality modifications; Develop and apply algorithms to annotate prosody and voice quality in expressive speech synthesis corpora Carry out a listener evaluation study of expressive synthetic speech.
I have to warn you that I haven't had much success in generating fine samples, although the source code itself is complete. I've tried to find what's wrong, but now changed my mind to open the current code to everyone because I know many people are working on this project and my work might be a help for them.
Chinese tech giant Baidu's text-to-speech system, Deep Voice, is making a lot of progress toward sounding more human. Baidu says that unlike previous text-to-speech systems, Deep Voice 2 finds shared qualities between the training voices entirely on its own, and without any previous guidance. "Deep voice 2 can learn from hundreds of voices and imitate them perfectly," a blog post says. In a research paper (PDF), Baidu concludes that its neural network can create voice pretty effectively even from small voice samples from hundreds of different speakers.
TL;DR Baidu's TTS system now supports multi-speaker conditioning, and can learn new speakers with very little data (a la LyreBird). I'm really excited about the recent influx of neural-net TTS systems, but all of the them seem to be too slow for real time dialog, or not publicly available, or both. Hoping that one of them gets a high quality open-source implementation soon!
Next time you hear a voice generated by Baidu's Deep Voice 2, you might not be able to tell whether it's human. That's leaps and bounds better than early versions of Deep Voice, which took multiple hours to learn one voice. Then, it autonomously derives unique voices from that model -- unlike voice assistants like Apple's Siri, which require that a human record thousands of hours of speech that engineers tune by hand, Deep Voice 2 doesn't require guidance or manual intervention. Google's WaveNet, a product of the company's DeepMind division, generates voices by sampling real human speech and independently creating its own sounds in a variety of voices.
While Lyrebird still retains a slight but noticeable robotic buzz characteristic of machine-generated speech, add some smartly-placed background noise to cover up the distortion, and the recordings could pass off as genuine to unsuspecting ears. AI-based personal assistants like Siri and Cortana rely on speech synthesizers to create a more natural interface with users, while audiobook companies may one day utilize the technology to automatically and cheaply generate products. "We want to improve human-computer interfaces and create completely new applications for speech synthesis," explains de Brébisson to Singularity Hub. That's because different voices share a lot of similar information that is already "stored" within the artificial network, explains de Brébisson.
Using a powerful new algorithm, a Montreal-based AI startup has developed a voice generator that can mimic virtually any person's voice, and even add an emotional punch when necessary. "We train our models on a huge dataset with thousands of speakers," Jose Sotelo, a team member at Lyrebird and a speech synthesis expert, told Gizmodo. Eventually, a refined version of this system could replicate a person's voice with incredible accuracy, making it virtually impossible for a human listener to discern the original from the emulation. It will be a long, long time before a speech synthesis program can replicate every single aspect of a person's distinctive speech, like the finer details of vocal timbre (i.e.
Now Baidu's artificial intelligence lab has revealed its work on speech synthesis. Google DeepMind called its system WaveNet. Baidu's system first has to work out the phenome boundaries in the following way: "(silence HH), (HH, EH), (EH, L), (L, OW), (OW, silence)." "To perform inference at real-time, we must take great care to never recompute any results, store the entire model in the processor cache (as opposed to main memory), and optimally utilize the available computational units," say the Baidu researchers.