Diff-TTS: A Denoising Diffusion Model for Text-to-Speech

Jeong, Myeonghun, Kim, Hyeongju, Cheon, Sung Jun, Choi, Byoung Jin, Kim, Nam Soo

Apr-3-2021–arXiv.org Artificial Intelligence

Although neural text-to-speech (TTS) models have attracted a lot of attention and succeeded in generating human-like speech, there is still room for improvements to its naturalness and architectural efficiency. In this work, we propose a novel non-autoregressive TTS model, namely Diff-TTS, which achieves highly natural and efficient speech synthesis. Given the text, Diff-TTS exploits a denoising diffusion framework to transform the noise signal into a mel-spectrogram via diffusion time steps. In order to learn the mel-spectrogram distribution conditioned on the text, we present a likelihood-based optimization method for TTS. Furthermore, to boost up the inference speed, we leverage the accelerated sampling method that allows Diff-TTS to generate raw waveforms much faster without significantly degrading perceptual quality. Through experiments, we verified that Diff-TTS generates 28 times faster than the real-time with a single NVIDIA 2080Ti GPU.

deep learning, diff-tt, speech synthesis, (18 more...)

arXiv.org Artificial Intelligence

Apr-3-2021

arXiv.org PDF

Add feedback

Country:
- Asia > South Korea (0.15)

Genre:
- Research Report (0.50)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks (1.00)
  - Representation & Reasoning (0.89)
  - Speech > Speech Synthesis (0.96)
  - Vision > Optical Character Recognition (0.63)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found