DeepMind Uses GANs to Convert Text to Speech

#artificialintelligence

Generative Adversarial Networks (GANs) have revolutionized high-fidelity image generation, making global headlines with their hyperrealistic portraits and content-swapping, while also raising concerns with convincing deepfake videos. Now, DeepMind researchers are expanding GANs to audio, with a new adversarial network approach for high fidelity speech synthesis. Text-to-Speech (TTS) is a process for converting text into a humanlike voice output. One of the most commonly used TTS network architectures is WaveNet, a neural autoregressive model for generating raw audio waveforms. But because WaveNet relies on the sequential generation of one audio sample at a time, it is poorly suited to today's massively parallel computers.


DeepMind Generates High Fidelity Speech With GAN-TTS

#artificialintelligence

GANs have achieved state-of-the-art results in image and video generation, and have been successfully applied for unsupervised feature learning among many other applications. Generative adversarial networks have seen rapid development in recent years, however, their audio generation prowess has largely gone unnoticed. In an attempt to explore the audio generation abilities of GANs, a team of DeepMind researchers published a work where they introduce a new model called GAN-TTS. Text-to-Speech (TTS) is a process for converting text into a humanlike voice output. Many audio generation models operate in the waveform domain.


Google's highly scalable AI can generate convincingly humanlike speech

#artificialintelligence

A generative adversarial network (GAN) is a versatile AI architecture type that's exceptionally well-suited to synthesizing images, videos, and text from limited data. But it's not much been applied to the audio production domain owing to a number of design challenges, which is why Google and Imperial College London researchers set out to create a GAN-based text-to-speech system capable of matching (or besting) state-of-the-art methods. They say that their model not only generates high-fidelity speech with "naturalness" but that it's highly parallelizable, meaning it's more easily trained across multiple machines compared with conventional alternatives. "A notable limitation of [state-of-the-art TTS] models is that they are difficult to parallelize over time: they predict each time step of an audio signal in sequence, which is computationally expensive and often impractical," wrote the coauthors. "A lot of recent research on neural models for TTS has focused on improving parallelism by predicting multiple time steps in parallel. An alternative approach for parallel waveform generation would be to use generative adversarial networks … To the best of our knowledge, GANs have not yet been applied at large scale to non-visual domains."


Waveform generation for text-to-speech synthesis using pitch-synchronous multi-scale generative adversarial networks

arXiv.org Machine Learning

The state-of-the-art in text-to-speech synthesis has recently improved considerably due to novel neural waveform generation methods, such as WaveNet. However, these methods suffer from their slow sequential inference process, while their parallel versions are difficult to train and even more expensive computationally. Meanwhile, generative adversarial networks (GANs) have achieved impressive results in image generation and are making their way into audio applications; parallel inference is among their lucrative properties. By adopting recent advances in GAN training techniques, this investigation studies waveform generation for TTS in two domains (speech signal and glottal excitation). Listening test results show that while direct waveform generation with GAN is still far behind WaveNet, a GAN-based glottal excitation model can achieve quality and voice similarity on par with a WaveNet vocoder.


6 Ways Speech Synthesis Is Being Powered By Deep Learning

#artificialintelligence

This model was open sourced back in June 2019 as an implementation of the paper Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis. This service is being offered by Resemble.ai. With this product, one can clone any voice and create dynamic, iterable, and unique voice content. Users input a short voice sample and the model -- trained only during playback time -- can immediately deliver text-to-speech utterances in the style of the sampled voice. Bengaluru's Deepsync offers an Augmented Intelligence that learns the way you speak.