Synthetic-speech researchers ... have been tackling a much tougher challenge: making computers say anything a live person could say, and in a voice that sounds natural.
– from Making Computers Talk. Andy Aaron, Ellen Eide and John F. Pitrelli. Scientific American Explore (March 17, 2003)
Considered the first electrical speech synthesizer, VODER (Voice Operation DEmonstratoR) was developed by Homer Dudley at Bell Labs and demonstrated at both the 1939 New York World's Fair and the 1939 Golden Gate International Exposition. Difficult to use and difficult to operate, VODER nonetheless paved the way for future machine-generated speech.
The work will have a particular focus on the development of structured acoustic models which take account of factors such as accent and speaking style, and on the development of machine learning techniques for vocoding. You will have the necessary programming ability to conduct research in this area, a background in statistical modeling using Hidden Markov Models, DNN, RNN, speech signal processing, and research experience in speech synthesis. A background in one or more of the following areas is also desirable: statistical parametric text-to-speech synthesis using HMMs and HSMMs; glottal source modeling; speech signal modeling; speaker adaptation using the MLLR or MAP family of techniques; familiarity with software tools including DNN, Deep Learning, RNN, HTK, HTS, Festival; and familiarity with modern machine learning. Develop and extend speech synthesis technologies in Oben's proprietary speech synthesis system, in view of the realization of prosody and voice quality modifications; Develop and apply algorithms to annotate prosody and voice quality in expressive speech synthesis corpora Carry out a listener evaluation study of expressive synthetic speech.
I have to warn you that I haven't had much success in generating fine samples, although the source code itself is complete. I've tried to find what's wrong, but now changed my mind to open the current code to everyone because I know many people are working on this project and my work might be a help for them.
Chinese tech giant Baidu's text-to-speech system, Deep Voice, is making a lot of progress toward sounding more human. Baidu says that unlike previous text-to-speech systems, Deep Voice 2 finds shared qualities between the training voices entirely on its own, and without any previous guidance. "Deep voice 2 can learn from hundreds of voices and imitate them perfectly," a blog post says. In a research paper (PDF), Baidu concludes that its neural network can create voice pretty effectively even from small voice samples from hundreds of different speakers.
TL;DR Baidu's TTS system now supports multi-speaker conditioning, and can learn new speakers with very little data (a la LyreBird). I'm really excited about the recent influx of neural-net TTS systems, but all of the them seem to be too slow for real time dialog, or not publicly available, or both. Hoping that one of them gets a high quality open-source implementation soon!
Next time you hear a voice generated by Baidu's Deep Voice 2, you might not be able to tell whether it's human. That's leaps and bounds better than early versions of Deep Voice, which took multiple hours to learn one voice. Then, it autonomously derives unique voices from that model -- unlike voice assistants like Apple's Siri, which require that a human record thousands of hours of speech that engineers tune by hand, Deep Voice 2 doesn't require guidance or manual intervention. Google's WaveNet, a product of the company's DeepMind division, generates voices by sampling real human speech and independently creating its own sounds in a variety of voices.
While Lyrebird still retains a slight but noticeable robotic buzz characteristic of machine-generated speech, add some smartly-placed background noise to cover up the distortion, and the recordings could pass off as genuine to unsuspecting ears. AI-based personal assistants like Siri and Cortana rely on speech synthesizers to create a more natural interface with users, while audiobook companies may one day utilize the technology to automatically and cheaply generate products. "We want to improve human-computer interfaces and create completely new applications for speech synthesis," explains de Brébisson to Singularity Hub. That's because different voices share a lot of similar information that is already "stored" within the artificial network, explains de Brébisson.
But there might come a time when a robot could dupe you into thinking that you're speaking with a real person, thanks to a new AI called WaveNet developed by Google's DeepMind team. Currently, developers use one of two methods to create speech programs. In order to build a speech program that actually sounds human, the team fed the neural network raw audio waveforms recorded from real human speakers. As such, WaveNet speaks by forming individual sound waves.
Google and other companies have made huge advances in making human speech understandable by machines, but making the reply sound realistic has proven more challenging. This is a challenge worthy of machine learning because modeling sounds as a waveform is extremely tricky. DeepMind found that audio generated by WaveNet was considerably more realistic than either concatenative or parametric TTS. Even when input text isn't provided, the neural network can generate outputs -- the babbling of a machine that sounds like a human speaking a language you've never heard before.
Existing text-to-speech (TTS) systems tend to use a system called concatenative TTS, where the audio is generated by recombining fragments of recorded speech. There's also a technique called parametric TTS that generates speech by passing information through a vocoder, but that sounds even less natural. DeepMind claimed that blind tests with human subjects showed the WaveNet audio to be at least 50% closer to real human speech--though of course such tests are subjective. The post includes clips of the "music" generated by WaveNets that were trained on classical music--again, a good approximation of actual music that might get away with it if you're not listening too closely.