Google's secretive DeepMind AI is analysing human speech to allow it to converse

Daily Mail - Science & tech

Google's secretive British DeepMind division is teaching its AI to talk like a human. The groundbreaking project has already halved the quality gap between computer systems and human speech, its creator say. Called WaveNet, it is capable of creating natural-sounding synthesized speech by analyzing sound waves from the human voice - rather than focusing on the human language. Google's DeepMind claims to have an AI that produces more natural-sounding synthesized speech. Google acquired UK-based DeepMind in 2014 for 533 million, and it has since beat a professional human Go player, learned how to play the Atari game Space Invaders and has read through thousands of Daily Mail and CNN articles.

Google DeepMind gets closer to sounding human


Artificial intelligence researchers at DeepMind have created some of the most realistic sounding human-like speech, using neural networks. Dubbed WaveNet, the AI promises significant improvements to computer-generated speech, and could eventually be used in digital personal assistants such as Siri, Cortana and Amazon's Alexa. The technology generates voices by sampling real human speech from both English and Mandarin speakers. In tests, the WaveNet generated speech was found to be more realistic than other forms of text-to-speech programs but still falling short of being truly convincing. In 500 blind tests, respondents were asked to judge sample sentences on a scale of one to five (five being most realistic).

And So It Begins: Google DeepMind AI Learns How To Talk Like Humans


Google has reached a milestone in its DeepMind artificial intelligence (A.I.) project with the successful development of technology that can mimic the sound of human voice. Dubbed as WaveNet, the breakthrough was described as a deep neural network that can generate raw audio wave forms to generate speech. It can reportedly beat existing Text-to-Speech systems. According to researchers in the Britain-based WaveNet unit, the gap in human performance, which could be demonstrated in an actual A.I. -- human conversation -- is reduced by as much as 50 percent. What is also interesting about the WaveNet technology is that it is capable of learning different voices and speech patterns to the point that it can even simulate mouth movements and artificial breaths in addition to emotions, language inflections and accents.

Google's DeepMind claims major milestone in making machines talk like humans ZDNet


On a scale from 1 to 5, WaveNet's quality of voice outstrips Google's current best parametric and concatenative systems. Google's UK artificial intelligence lab, DeepMind, has developed a deep neural network that produces more human-like speech than Google's previous text-to-speech (TTS) systems. DeepMind has published a new paper describing WaveNet, a convolutional neural network it says has closed the gap between machine-generated and human speech by 50 percent in both US English and Mandarin Chinese. Not only this, but the network can also seamlessly switch between different voices and generate realistic music fragments. The researchers note that today's best TTS systems, generally considered to be powered by Google, are built on "speech fragments" recorded from a single speaker.

Google's DeepMind Claims Massive Progress in Synthesized Speech


Researchers at Google's DeepMind artificial intelligence division claim to have come up with a way of producing much more natural-sounding synthesized speech, compared with the techniques that are currently in use. Existing text-to-speech (TTS) systems tend to use a system called concatenative TTS, where the audio is generated by recombining fragments of recorded speech. There's also a technique called parametric TTS that generates speech by passing information through a vocoder, but that sounds even less natural. So DeepMind has come up with a new technique called WaveNet that learns from the audio it's fed, and produces raw audio sample-by-sample. To give an idea of how detailed that is, we're talking at least 16,000 samples per second.