Abstract: In this work, we propose a non-autoregressive seq2seq model that converts text to spectrogram. It is fully convolutional and obtains about 17.5 times speed-up over Deep Voice 3 at synthesis while maintaining comparable speech quality using a WaveNet vocoder. Interestingly, it has even fewer attention errors than the autoregressive model on the challenging test sentences. Furthermore, we build the first fully parallel neural text-to- speech system by applying the inverse autoregressive flow (IAF) as the parallel neural vocoder. Our system can synthesize speech from text through a single feed-forward pass.
May-22-2019, 03:02:17 GMT