In this paper, we present a database of emotional speech intended to be open-sourced and used for synthesis and generation purpose. It contains data for male and female actors in English and a male actor in French. The database covers 5 emotion classes so it could be suitable to build synthesis and voice transformation systems with the potential to control the emotional dimension in a continuous way. We show the data's efficiency by building a simple MLP system converting neutral to angry speech style and evaluate it via a CMOS perception test. Even though the system is a very simple one, the test show the efficiency of the data which is promising for future work.
This paper proposes a method for generating speech from filterbank mel frequency cepstral coefficients (MFCC), which are widely used in speech applications, such as ASR, but are generally considered unusable for speech synthesis. First, we predict fundamental frequency and voicing information from MFCCs with an autoregressive recurrent neural net. Second, the spectral envelope information contained in MFCCs is converted to all-pole filters, and a pitch-synchronous excitation model matched to these filters is trained. Finally, we introduce a generative adversarial network -based noise model to add a realistic high-frequency stochastic component to the modeled excitation signal. The results show that high quality speech reconstruction can be obtained, given only MFCC information at test time.
This model was open sourced back in June 2019 as an implementation of the paper Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis. This service is being offered by Resemble.ai. With this product, one can clone any voice and create dynamic, iterable, and unique voice content. Users input a short voice sample and the model -- trained only during playback time -- can immediately deliver text-to-speech utterances in the style of the sampled voice. Bengaluru's Deepsync offers an Augmented Intelligence that learns the way you speak.
Recent advances in speech synthesis suggest that limitations such as the lossy nature of the amplitude spectrum with minimum phase approximation and the over-smoothing effect in acoustic modeling can be overcome by using advanced machine learning approaches. In this paper, we build a framework in which we can fairly compare new vocoding and acoustic modeling techniques with conventional approaches by means of a large scale crowdsourced evaluation. Results on acoustic models showed that generative adversarial networks and an autoregressive (AR) model performed better than a normal recurrent network and the AR model performed best. Evaluation on vocoders by using the same AR acoustic model demonstrated that a Wavenet vocoder outperformed classical source-filter-based vocoders. Particularly, generated speech waveforms from the combination of AR acoustic model and Wavenet vocoder achieved a similar score of speech quality to vocoded speech.
Being an iOS user, how many times do you talk to Siri in a day? If you are a keen observer, then you know that Siri's voice sounds much more like a human in iOS 11 than it has before. This is because Apple is digging deeper into the technology of artificial intelligence, machine learning, and deep learning to offer the best personal assistant experience to its users. From the introduction of Siri with the iPhone 4S to its continuation in iOS 11, this personal assistant has evolved to get closer to humans and establish good relations with them. To reply to voice commands of users, Siri uses speech synthesis combined with deep learning.