Synthetic-speech researchers ... have been tackling a much tougher challenge: making computers say anything a live person could say, and in a voice that sounds natural.
– from Making Computers Talk. Andy Aaron, Ellen Eide and John F. Pitrelli. Scientific American Explore (March 17, 2003)
Recent speech technology research has seen a growing interest in using WaveNets as statistical vocoders, i.e., generating speech waveforms from acoustic features. These models have been shown to improve the generated speech quality over classical vocoders in many tasks, such as text-to-speech synthesis and voice conversion. Furthermore, conditioning WaveNets with acoustic features allows sharing the waveform generator model across multiple speakers without additional speaker codes. However, multi-speaker WaveNet models require large amounts of training data and computation to cover the entire acoustic space. This paper proposes leveraging the source-filter model of speech production to more effectively train a speaker-independent waveform generator with limited resources. We present a multi-speaker 'GlotNet' vocoder, which utilizes a WaveNet to generate glottal excitation waveforms, which are then used to excite the corresponding vocal tract filter to produce speech. Listening tests show that the proposed model performs favourably to a direct WaveNet vocoder trained with the same model architecture and data.
Google will now let developers use the text-to-speech synthesis that powers the voices in Google Assistant and Maps. Cloud Text-to-Speech is available now through the Google Cloud Platform and the company says it can be used to power voice response systems in call centers, enable IoT device speech and convert media like news articles and books into a spoken format. There are 32 different voice options in 12 languages and users can customize pitch, speaking rate and volume gain. Additionally, a selection of the available voices were built with Google's WaveNet model. It was developed by Google's DeepMind team and the company first announced it in 2016.
Google Cloud outlined Cloud Text-to-Speech a machine learning service that uses a model by Google's Deepmind subsidiary to analyze raw audio. With the move, developers will get more access to the text to natural sounding speech technology used in Google Assistant, Search, Maps and others. According to Google, Cloud Text-to-Speech can be used to power call center voice response systems, enabling Internet of things devices to talk and converting text-based media into spoken formats. Google Cloud Text-to-Speech allows customers to choose from 32 different voices in 12 languages. Everything you need to know about the cloud, explained How to choose your cloud provider: AWS, Google or Microsoft?
Stephen Hawking's computer-generated voice is so iconic that it's trademarked -- The filmmakers behind The Theory of Everything had to get Hawking's personal permission to use the voice in his biopic. But that voice has an interesting origin story of its own. Back in the '80s, when Hawking was first exploring text-to-speech communication options after he lost the power of speech, a pioneer in computer-generated speech algorithms was working at MIT on that very thing. His name was Dennis Klatt. As Wired uncovered, Klatt's work was incorporated into one of the first devices that translated speech into text: the DECtalk.
We present Deep Voice 3, a fully-convolutional attention-based neural text-to-speech (TTS) system. Deep Voice 3 matches state-of-the-art neural speech synthesis systems in naturalness while training ten times faster. We scale Deep Voice 3 to data set sizes unprecedented for TTS, training on more than eight hundred hours of audio from over two thousand speakers. In addition, we identify common error modes of attention-based speech synthesis networks, demonstrate how to mitigate them, and compare several different waveform synthesis methods. We also describe how to scale inference to ten million queries per day on one single-GPU server.
Google has plunged high towards its'AI-first' dream. The tech giant has attempted to develop a Text-to-speech system that has exactly human-like articulation. This AI system is called "Tacotron 2" that has the ability to give an AI-generated computer speech in a human-voice. Google researchers mentioned in the blog post that the new procedure does not utitilise complex linguistic and acoustic features as input. In place of it, they developed human-like speech from text using neural networks trained using only speech examples and corresponding text transcript.
Get ready for the little person living inside your phone and speaker to sound a lot more life-like. Google believes it has reached a new milestone in the quest to make computer-generated speech indistinguishable from human speech with Tacotron 2, a system that trains neural networks to generate eerily natural-sounding speech from text, and they have the samples to prove it.