Synthetic-speech researchers ... have been tackling a much tougher challenge: making computers say anything a live person could say, and in a voice that sounds natural.
– from Making Computers Talk. Andy Aaron, Ellen Eide and John F. Pitrelli. Scientific American Explore (March 17, 2003)
Stephen Hawking's computer-generated voice is so iconic that it's trademarked -- The filmmakers behind The Theory of Everything had to get Hawking's personal permission to use the voice in his biopic. But that voice has an interesting origin story of its own. Back in the '80s, when Hawking was first exploring text-to-speech communication options after he lost the power of speech, a pioneer in computer-generated speech algorithms was working at MIT on that very thing. His name was Dennis Klatt. As Wired uncovered, Klatt's work was incorporated into one of the first devices that translated speech into text: the DECtalk.
We present Deep Voice 3, a fully-convolutional attention-based neural text-to-speech (TTS) system. Deep Voice 3 matches state-of-the-art neural speech synthesis systems in naturalness while training ten times faster. We scale Deep Voice 3 to data set sizes unprecedented for TTS, training on more than eight hundred hours of audio from over two thousand speakers. In addition, we identify common error modes of attention-based speech synthesis networks, demonstrate how to mitigate them, and compare several different waveform synthesis methods. We also describe how to scale inference to ten million queries per day on one single-GPU server.
Google has plunged high towards its'AI-first' dream. The tech giant has attempted to develop a Text-to-speech system that has exactly human-like articulation. This AI system is called "Tacotron 2" that has the ability to give an AI-generated computer speech in a human-voice. Google researchers mentioned in the blog post that the new procedure does not utitilise complex linguistic and acoustic features as input. In place of it, they developed human-like speech from text using neural networks trained using only speech examples and corresponding text transcript.
Get ready for the little person living inside your phone and speaker to sound a lot more life-like. Google believes it has reached a new milestone in the quest to make computer-generated speech indistinguishable from human speech with Tacotron 2, a system that trains neural networks to generate eerily natural-sounding speech from text, and they have the samples to prove it.
Last year, Google showed off WaveNet, a new way of generating speech that didn't rely on a bulky library of word bits or cheap shortcuts that result in stilted speech. WaveNet used machine learning to build a voice sample by sample, and the results were, as I put it then, "eerily convincing." Previously bound to the lab, the tech has now been deployed in the latest version of Google Assistant. The general idea behind the tech was to recreate words and sentences not by coding grammatical and tonal rules manually, but allowing a machine learning system to see those patterns in speech and generate them sample by sample. A sample, in this case, being the tone generated every 1/16,000th of a second.
Considered the first electrical speech synthesizer, VODER (Voice Operation DEmonstratoR) was developed by Homer Dudley at Bell Labs and demonstrated at both the 1939 New York World's Fair and the 1939 Golden Gate International Exposition. Difficult to use and difficult to operate, VODER nonetheless paved the way for future machine-generated speech.