Efficiently Trained Low-Resource Mongolian Text-to-Speech System Based On FullConv-TTS

Liang, Ziqi

arXiv.org Artificial Intelligence 

Recurrent neural networks (RNNs) have become a standard modeling technique for sequential data and are used in novel text-tospeech models. However, training a TTS model which includes RNN components requires powerful GPU performance and takes a long time. In contrast, CNN-based sequence synthesis techniques can significantly reduce the training time of a text-to-speech model while guaranteeing a certain performance due to its high parallelism. We propose a novel text-to-speech system based on deep convolutional neural networks that does not employ any RNN components and is a two-stage training endto-end TTS model. Meanwhile, we improve the robustness of our model by a series of data enhancement methods, such as time warping, frequency masking and time masking, for the low resource problem. We propose a novel text-to-speech system based on deep convolutional neural networks, which does not employ any RNN components (recurrent units) and is a two-stage training end-to-end TTS model. Also, to address the low resource problem of lacking labeled data, we improve the robustness of our model by a series of data enhancement methods such as time warping, frequency masking and time masking. The final experimental results show that a TTS model using only CNN components can reduce the training time while ensuring the quality and naturalness of the synthesized speech compared to using mainstream TTS models, such as Tacotron2 and the vocoder Hifigan.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found