Synthetic-speech researchers ... have been tackling a much tougher challenge: making computers say anything a live person could say, and in a voice that sounds natural.
– from Making Computers Talk. Andy Aaron, Ellen Eide and John F. Pitrelli. Scientific American Explore (March 17, 2003)
I have to warn you that I haven't had much success in generating fine samples, although the source code itself is complete. I've tried to find what's wrong, but now changed my mind to open the current code to everyone because I know many people are working on this project and my work might be a help for them.
Chinese tech giant Baidu's text-to-speech system, Deep Voice, is making a lot of progress toward sounding more human. Baidu says that unlike previous text-to-speech systems, Deep Voice 2 finds shared qualities between the training voices entirely on its own, and without any previous guidance. "Deep voice 2 can learn from hundreds of voices and imitate them perfectly," a blog post says. In a research paper (PDF), Baidu concludes that its neural network can create voice pretty effectively even from small voice samples from hundreds of different speakers.
TL;DR Baidu's TTS system now supports multi-speaker conditioning, and can learn new speakers with very little data (a la LyreBird). I'm really excited about the recent influx of neural-net TTS systems, but all of the them seem to be too slow for real time dialog, or not publicly available, or both. Hoping that one of them gets a high quality open-source implementation soon!
Next time you hear a voice generated by Baidu's Deep Voice 2, you might not be able to tell whether it's human. That's leaps and bounds better than early versions of Deep Voice, which took multiple hours to learn one voice. Then, it autonomously derives unique voices from that model -- unlike voice assistants like Apple's Siri, which require that a human record thousands of hours of speech that engineers tune by hand, Deep Voice 2 doesn't require guidance or manual intervention. Google's WaveNet, a product of the company's DeepMind division, generates voices by sampling real human speech and independently creating its own sounds in a variety of voices.
While Lyrebird still retains a slight but noticeable robotic buzz characteristic of machine-generated speech, add some smartly-placed background noise to cover up the distortion, and the recordings could pass off as genuine to unsuspecting ears. AI-based personal assistants like Siri and Cortana rely on speech synthesizers to create a more natural interface with users, while audiobook companies may one day utilize the technology to automatically and cheaply generate products. "We want to improve human-computer interfaces and create completely new applications for speech synthesis," explains de Brébisson to Singularity Hub. That's because different voices share a lot of similar information that is already "stored" within the artificial network, explains de Brébisson.
Using a powerful new algorithm, a Montreal-based AI startup has developed a voice generator that can mimic virtually any person's voice, and even add an emotional punch when necessary. "We train our models on a huge dataset with thousands of speakers," Jose Sotelo, a team member at Lyrebird and a speech synthesis expert, told Gizmodo. Eventually, a refined version of this system could replicate a person's voice with incredible accuracy, making it virtually impossible for a human listener to discern the original from the emulation. It will be a long, long time before a speech synthesis program can replicate every single aspect of a person's distinctive speech, like the finer details of vocal timbre (i.e.
Amazon Polly provides speech synthesis functionality that overcomes those challenges, allowing you to focus on building applications that use text-to-speech instead of addressing interpretation challenges. The application provides two methods – one for sending information about a new post, which should be converted into an MP3 file, and one for retrieving information about the post (including a link to the MP3 file stored in an S3 bucket). Now let's create the Lambda function that converts text that is stored in a DynamoDB table into an audio file, "Convert to Audio." From API Gateway console, we choose Create API option.
The possibilities include good old-fashioned cassette tape recorders, specialised talking book readers such as the Victor Reader Stream, CD players, MP3 players, smartphones, tablets and PCs. This includes a 7th-generation Kindle ebook reader, a small external Kindle Audio Adapter, and VoiceView for Kindle software. There are also TTS apps for smartphones and tablets, including Voice Dream Reader for Apple and Android. For example, the new Victor Reader Stream plays Audible books while also including Acapela's TTS software, which can voice text files and ebooks in the ePub format.
Panasonic's New Smart TVs Now Listen and Speak with Nuance's Dragon TV Panasonic's New SMART VIERA HDTVs Voice Interaction Lets People Find TV Content, Search the Web, Get Access to Apps and More with the Power of Dragon Now people can simply sit back and speak to find content, search the web, control volume and more – creating a more interactive and intelligent television experience. Dragon TV brings the personal assistant experience to the living room, where consumers are able to speak to their TV and it responds, whether changing the channel, finding favorite movies and content by program name or actor, or staying connected with social media or the web – and much more. Panasonic SMART VIERA TVs featuring Dragon TV will be available worldwide beginning in the Spring of 2013. Dragon TV is a part of Nuance's portfolio of voice, touch and natural language understanding innovations that are defining a new generation of intelligent systems and personal assistant technologies, which also includes Dragon NaturallySpeaking, Dragon Dictate for Mac, Dragon Assistant for Intel-inspired Ultrabooks, Dragon Dictation, Dragon Go!, Dragon Drive!, Dragon ID, Dragon Voicemail to Text, and Swype.