Goto

Collaborating Authors

 gan-tts




We thank all the reviewers for their valuable comments

Neural Information Processing Systems

We thank all the reviewers for their valuable comments. We would like to clarify that, 'When the model was trained without the mel-spectrogram loss, the training process We also think that applying the L1/L2 loss gives no disadvantage in one-to-one mapping as our work. We will clarify the details of the experiments in Section 3. Table 1: Mean Opinion Scores. All models were trained up to 500k steps. MOS evaluation results are shown in [Table 1].




A Spectral Energy Distance for Parallel Speech Synthesis

arXiv.org Machine Learning

Speech synthesis is an important practical generative modeling problem that has seen great progress over the last few years, with likelihood-based autoregressive neural models now outperforming traditional concatenative systems. A downside of such autoregressive models is that they require executing tens of thousands of sequential operations per second of generated audio, making them ill-suited for deployment on specialized deep learning hardware. Here, we propose a new learning method that allows us to train highly parallel models of speech, without requiring access to an analytical likelihood function. Our approach is based on a generalized energy distance between the distributions of the generated and real audio. This spectral energy distance is a proper scoring rule with respect to the distribution over magnitude-spectrograms of the generated waveform audio and offers statistical consistency guarantees. The distance can be calculated from minibatches without bias, and does not involve adversarial learning, yielding a stable and consistent method for training implicit generative models. Empirically, we achieve state-of-the-art generation quality among implicit generative models, as judged by the recently-proposed cFDSD metric. When combining our method with adversarial techniques, we also improve upon the recently-proposed GAN-TTS model in terms of Mean Opinion Score as judged by trained human evaluators.


DeepMind Generates High Fidelity Speech With GAN-TTS

#artificialintelligence

GANs have achieved state-of-the-art results in image and video generation, and have been successfully applied for unsupervised feature learning among many other applications. Generative adversarial networks have seen rapid development in recent years, however, their audio generation prowess has largely gone unnoticed. In an attempt to explore the audio generation abilities of GANs, a team of DeepMind researchers published a work where they introduce a new model called GAN-TTS. Text-to-Speech (TTS) is a process for converting text into a humanlike voice output. Many audio generation models operate in the waveform domain.


Google's highly scalable AI can generate convincingly humanlike speech

#artificialintelligence

A generative adversarial network (GAN) is a versatile AI architecture type that's exceptionally well-suited to synthesizing images, videos, and text from limited data. But it's not much been applied to the audio production domain owing to a number of design challenges, which is why Google and Imperial College London researchers set out to create a GAN-based text-to-speech system capable of matching (or besting) state-of-the-art methods. They say that their model not only generates high-fidelity speech with "naturalness" but that it's highly parallelizable, meaning it's more easily trained across multiple machines compared with conventional alternatives. "A notable limitation of [state-of-the-art TTS] models is that they are difficult to parallelize over time: they predict each time step of an audio signal in sequence, which is computationally expensive and often impractical," wrote the coauthors. "A lot of recent research on neural models for TTS has focused on improving parallelism by predicting multiple time steps in parallel. An alternative approach for parallel waveform generation would be to use generative adversarial networks … To the best of our knowledge, GANs have not yet been applied at large scale to non-visual domains."