We propose an end-to-end speech synthesizer, Fast DCTTS, that synthesizes speech in real time on a single CPU thread. The proposed model is composed of a carefully-tuned lightweight network designed by applying multiple network reduction and fidelity improvement techniques. In addition, we propose a novel group highway activation that can compromise between computational efficiency and the regularization effect of the gating mechanism. As well, we introduce a new metric called Elastic mel-cepstral distortion (EMCD) to measure the fidelity of the output mel-spectrogram. In experiments, we analyze the effect of the acceleration techniques on speed and speech quality. Compared with the baseline model, the proposed model exhibits improved MOS from 2.62 to 2.74 with only 1.76% computation and 2.75% parameters. The speed on a single CPU thread was improved by 7.45 times, which is fast enough to produce mel-spectrogram in real time without GPU.
TikTok's text-to-speech feature allows creators to put text over their videos and have a Siri-like voice read it out loud. It's a helpful way to annotate your videos to help describe what's happening, add context, or to serve whatever purpose you see fit. There's also no rule saying you can't use it just to make the text-to-speech voice say silly things. Here's how you can easily add text-to-speech to your TikTok videos. You can cancel it, edit the text, or adjust the duration of the text just by tapping the text again. Once you're happy with your video, just click "Next," apply whatever hashtags you want, and post!
Abstract: In this work, we propose a non-autoregressive seq2seq model that converts text to spectrogram. It is fully convolutional and obtains about 17.5 times speed-up over Deep Voice 3 at synthesis while maintaining comparable speech quality using a WaveNet vocoder. Interestingly, it has even fewer attention errors than the autoregressive model on the challenging test sentences. Furthermore, we build the first fully parallel neural text-to- speech system by applying the inverse autoregressive flow (IAF) as the parallel neural vocoder. Our system can synthesize speech from text through a single feed-forward pass.
Chinese tech giant Baidu's text-to-speech system, Deep Voice, is making a lot of progress toward sounding more human. The latest news about the tech are audio samples showcasing its ability to accurately portray differences in regional accents. The company says that the new version, aptly named Deep Voice 2, has been able to "learn from hundreds of unique voices from less than a half an hour of data per speaker, while achieving high audio quality." That's compared to the 20 hours hours of training it took to get similar results from the previous iteration, for a single voice, further pushing its efficiency past Google's WaveNet in a few months time. Baidu says that unlike previous text-to-speech systems, Deep Voice 2 finds shared qualities between the training voices entirely on its own, and without any previous guidance.