I have to warn you that I haven't had much success in generating fine samples, although the source code itself is complete. I've tried to find what's wrong, but now changed my mind to open the current code to everyone because I know many people are working on this project and my work might be a help for them.
When machines speak, they sound stilted, robotic and mechanical – but they're getting better. Google's latest text-to-speech system, called Tacotron 2, generates sounds entirely from scratch, and the search giant claims the results are as good as those built using professional voice artists. Previous systems normally produce speech by assembling human-recorded vocal sounds into words and sentences. In comparison, Tacotron 2 was trained on over 24 hours of human speech and corresponding transcripts, and could then generate completely new audio of phrases from a given text even if it had never seen some of the words before. You can listen to the results here.
Abstract: This paper describes Tacotron 2, a neural network architecture for speech synthesis directly from text. The system is composed of a recurrent sequence-to-sequence feature prediction network that maps character embeddings to mel-scale spectrograms, followed by a modified WaveNet model acting as a vocoder to synthesize timedomain waveforms from those spectrograms. Our model achieves a mean opinion score (MOS) of 4.53 comparable to a MOS of 4.58 for professionally recorded speech. To validate our design choices, we present ablation studies of key components of our system and evaluate the impact of using mel spectrograms as the input to WaveNet instead of linguistic, duration, and F0 features. We further demonstrate that using a compact acoustic intermediate representation enables significant simplification of the WaveNet architecture.
Creating convincing artificial speech is a hot pursuit right now, with Google arguably in the lead. The company may have leapt ahead again with the announcement today of Tacotron 2, a new method for training a neural network to produce realistic speech from text that requires almost no grammatical expertise.