Speech Synthesis


Speaker-independent raw waveform model for glottal excitation

arXiv.org Machine Learning

Recent speech technology research has seen a growing interest in using WaveNets as statistical vocoders, i.e., generating speech waveforms from acoustic features. These models have been shown to improve the generated speech quality over classical vocoders in many tasks, such as text-to-speech synthesis and voice conversion. Furthermore, conditioning WaveNets with acoustic features allows sharing the waveform generator model across multiple speakers without additional speaker codes. However, multi-speaker WaveNet models require large amounts of training data and computation to cover the entire acoustic space. This paper proposes leveraging the source-filter model of speech production to more effectively train a speaker-independent waveform generator with limited resources. We present a multi-speaker 'GlotNet' vocoder, which utilizes a WaveNet to generate glottal excitation waveforms, which are then used to excite the corresponding vocal tract filter to produce speech. Listening tests show that the proposed model performs favourably to a direct WaveNet vocoder trained with the same model architecture and data.


Google's new text-to-speech service has more realistic voices

Engadget

Google will now let developers use the text-to-speech synthesis that powers the voices in Google Assistant and Maps. Cloud Text-to-Speech is available now through the Google Cloud Platform and the company says it can be used to power voice response systems in call centers, enable IoT device speech and convert media like news articles and books into a spoken format. There are 32 different voice options in 12 languages and users can customize pitch, speaking rate and volume gain. Additionally, a selection of the available voices were built with Google's WaveNet model. It was developed by Google's DeepMind team and the company first announced it in 2016.


Google Cloud Platform launches text-to-speech service to compete with AWS Polly

ZDNet

Google Cloud outlined Cloud Text-to-Speech a machine learning service that uses a model by Google's Deepmind subsidiary to analyze raw audio. With the move, developers will get more access to the text to natural sounding speech technology used in Google Assistant, Search, Maps and others. According to Google, Cloud Text-to-Speech can be used to power call center voice response systems, enabling Internet of things devices to talk and converting text-based media into spoken formats. Google Cloud Text-to-Speech allows customers to choose from 32 different voices in 12 languages. Everything you need to know about the cloud, explained How to choose your cloud provider: AWS, Google or Microsoft?


Meet the man whose voice became Stephen Hawking's

Mashable

Stephen Hawking's computer-generated voice is so iconic that it's trademarked -- The filmmakers behind The Theory of Everything had to get Hawking's personal permission to use the voice in his biopic. But that voice has an interesting origin story of its own. Back in the '80s, when Hawking was first exploring text-to-speech communication options after he lost the power of speech, a pioneer in computer-generated speech algorithms was working at MIT on that very thing. His name was Dennis Klatt. As Wired uncovered, Klatt's work was incorporated into one of the first devices that translated speech into text: the DECtalk.


Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning

arXiv.org Artificial Intelligence

We present Deep Voice 3, a fully-convolutional attention-based neural text-to-speech (TTS) system. Deep Voice 3 matches state-of-the-art neural speech synthesis systems in naturalness while training ten times faster. We scale Deep Voice 3 to data set sizes unprecedented for TTS, training on more than eight hundred hours of audio from over two thousand speakers. In addition, we identify common error modes of attention-based speech synthesis networks, demonstrate how to mitigate them, and compare several different waveform synthesis methods. We also describe how to scale inference to ten million queries per day on one single-GPU server.


Google Creates A Text To Speech AI system Alike Human voice

#artificialintelligence

Google has plunged high towards its'AI-first' dream. The tech giant has attempted to develop a Text-to-speech system that has exactly human-like articulation. This AI system is called "Tacotron 2" that has the ability to give an AI-generated computer speech in a human-voice. Google researchers mentioned in the blog post that the new procedure does not utitilise complex linguistic and acoustic features as input. In place of it, they developed human-like speech from text using neural networks trained using only speech examples and corresponding text transcript.


Google's New Text-to-Speech AI Is so Good We Bet You Can't Tell It From a Real Human

#artificialintelligence

Can you tell the difference between AI-generated computer speech and a real, live human being? Maybe you've always thought you could. Maybe you're fond of Alexa and Siri but believe you would never confuse either of them with an actual woman.


Google develops human-like text-to-speech artificial intelligence system

#artificialintelligence

In a major step towards its "AI first" dream, Google has developed a text-to-speech artificial intelligence (AI) system that will confuse you with its human-like articulation.


Flipboard on Flipboard

#artificialintelligence

Can you tell the difference between AI-generated computer speech and a real, live human being? Maybe you've always thought you could. Maybe you're fond of Alexa and Siri but believe you would never confuse either of them with an actual woman.


Google's new text-to-speech system sounds convincingly human

#artificialintelligence

Get ready for the little person living inside your phone and speaker to sound a lot more life-like. Google believes it has reached a new milestone in the quest to make computer-generated speech indistinguishable from human speech with Tacotron 2, a system that trains neural networks to generate eerily natural-sounding speech from text, and they have the samples to prove it.