Synthetic-speech researchers ... have been tackling a much tougher challenge: making computers say anything a live person could say, and in a voice that sounds natural.
– from Making Computers Talk. Andy Aaron, Ellen Eide and John F. Pitrelli. Scientific American Explore (March 17, 2003)
Considered the first electrical speech synthesizer, VODER (Voice Operation DEmonstratoR) was developed by Homer Dudley at Bell Labs and demonstrated at both the 1939 New York World's Fair and the 1939 Golden Gate International Exposition. Difficult to use and difficult to operate, VODER nonetheless paved the way for future machine-generated speech.
The work will have a particular focus on the development of structured acoustic models which take account of factors such as accent and speaking style, and on the development of machine learning techniques for vocoding. You will have the necessary programming ability to conduct research in this area, a background in statistical modeling using Hidden Markov Models, DNN, RNN, speech signal processing, and research experience in speech synthesis. A background in one or more of the following areas is also desirable: statistical parametric text-to-speech synthesis using HMMs and HSMMs; glottal source modeling; speech signal modeling; speaker adaptation using the MLLR or MAP family of techniques; familiarity with software tools including DNN, Deep Learning, RNN, HTK, HTS, Festival; and familiarity with modern machine learning. Develop and extend speech synthesis technologies in Oben's proprietary speech synthesis system, in view of the realization of prosody and voice quality modifications; Develop and apply algorithms to annotate prosody and voice quality in expressive speech synthesis corpora Carry out a listener evaluation study of expressive synthetic speech.
I have to warn you that I haven't had much success in generating fine samples, although the source code itself is complete. I've tried to find what's wrong, but now changed my mind to open the current code to everyone because I know many people are working on this project and my work might be a help for them.
Chinese tech giant Baidu's text-to-speech system, Deep Voice, is making a lot of progress toward sounding more human. Baidu says that unlike previous text-to-speech systems, Deep Voice 2 finds shared qualities between the training voices entirely on its own, and without any previous guidance. "Deep voice 2 can learn from hundreds of voices and imitate them perfectly," a blog post says. In a research paper (PDF), Baidu concludes that its neural network can create voice pretty effectively even from small voice samples from hundreds of different speakers.
TL;DR Baidu's TTS system now supports multi-speaker conditioning, and can learn new speakers with very little data (a la LyreBird). I'm really excited about the recent influx of neural-net TTS systems, but all of the them seem to be too slow for real time dialog, or not publicly available, or both. Hoping that one of them gets a high quality open-source implementation soon!
Next time you hear a voice generated by Baidu's Deep Voice 2, you might not be able to tell whether it's human. That's leaps and bounds better than early versions of Deep Voice, which took multiple hours to learn one voice. Then, it autonomously derives unique voices from that model -- unlike voice assistants like Apple's Siri, which require that a human record thousands of hours of speech that engineers tune by hand, Deep Voice 2 doesn't require guidance or manual intervention. Google's WaveNet, a product of the company's DeepMind division, generates voices by sampling real human speech and independently creating its own sounds in a variety of voices.
While Lyrebird still retains a slight but noticeable robotic buzz characteristic of machine-generated speech, add some smartly-placed background noise to cover up the distortion, and the recordings could pass off as genuine to unsuspecting ears. AI-based personal assistants like Siri and Cortana rely on speech synthesizers to create a more natural interface with users, while audiobook companies may one day utilize the technology to automatically and cheaply generate products. "We want to improve human-computer interfaces and create completely new applications for speech synthesis," explains de Brébisson to Singularity Hub. That's because different voices share a lot of similar information that is already "stored" within the artificial network, explains de Brébisson.
Using a powerful new algorithm, a Montreal-based AI startup has developed a voice generator that can mimic virtually any person's voice, and even add an emotional punch when necessary. "We train our models on a huge dataset with thousands of speakers," Jose Sotelo, a team member at Lyrebird and a speech synthesis expert, told Gizmodo. Eventually, a refined version of this system could replicate a person's voice with incredible accuracy, making it virtually impossible for a human listener to discern the original from the emulation. It will be a long, long time before a speech synthesis program can replicate every single aspect of a person's distinctive speech, like the finer details of vocal timbre (i.e.
Amazon Polly provides speech synthesis functionality that overcomes those challenges, allowing you to focus on building applications that use text-to-speech instead of addressing interpretation challenges. The application provides two methods – one for sending information about a new post, which should be converted into an MP3 file, and one for retrieving information about the post (including a link to the MP3 file stored in an S3 bucket). Now let's create the Lambda function that converts text that is stored in a DynamoDB table into an audio file, "Convert to Audio." From API Gateway console, we choose Create API option.
The possibilities include good old-fashioned cassette tape recorders, specialised talking book readers such as the Victor Reader Stream, CD players, MP3 players, smartphones, tablets and PCs. This includes a 7th-generation Kindle ebook reader, a small external Kindle Audio Adapter, and VoiceView for Kindle software. There are also TTS apps for smartphones and tablets, including Voice Dream Reader for Apple and Android. For example, the new Victor Reader Stream plays Audible books while also including Acapela's TTS software, which can voice text files and ebooks in the ePub format.