Collaborating Authors

speech synthesis



Voice artificial intelligence is an emerging technology that uses voice commands to interact with humans. The technology is witnessing tremendous growth and intense research in modern engineering to explore untapped areas. We are well accustomed to hearing AI voices narrating monotone articles and reports. One of the most trending examples of their use by many people is Alexa and Siri-enabled devices. These devices are getting significant recognition, and the market for similar products is growing exceptionally.

Raising Robovoices

Communications of the ACM

In a critical episode of The Mandalorian, a TV series set in the Star Wars universe, a mysterious Jedi fights his way through a horde of evil robots. As the heroes of the show wait anxiously to learn the identity of their cloaked savior, he lowers his hood, and--spoiler alert-- they meet a young Luke Skywalker. Actually, what we see is an animated, de-aged version of the Jedi. Then Luke speaks, in a voice that sounds very much like the 1980s-era rendition of the character, thanks to the use of an advanced machine learning model developed by the voice technology startup Respeecher. "No one noticed that it was generated by a machine," says Dmytro Bielievtsov, chief technology officer at Respeecher.

KT restores voices of Lou Gehrig's disease patients with artificial intelligence


KT, a major major telecom company in South Korea, has restored the voices of eight patients with Lou Gehrig's disease using artificial intelligence-based speech synthesis technology. Patients can communicate with their friends and family members using a smartphone app that converts written text into their voices within a second. Lou Gehrig's disease, also known as amyotrophic lateral sclerosis (ALS), is a neurodegenerative disease that damages nerve cells in the brain and spinal cord. It can eventually affect patients' movements including walking and swallowing due to the loss of voluntary muscle control. Patients can lose their voices as the muscles of their tongues weaken.

Make your content available in over 128 languages and dialects for only $37


The internet has made the world ever smaller since its inception, but many new platforms have sprung to life in just the last decade or so. Between podcasts, webinars, and new forms of social media, you can now reach more people in a variety of mediums. And now you can offer your content in multiple languages to extend that reach even further with a lifetime subscription to TexTalky, which is on sale at the moment for only $37. Cloud-based TexTalky is a natural human voice text-to-speech synthesizer powered by artificial intelligence. The platform is all online, so you don't have to install any apps to use it. You can even quickly share your files by exporting them in WAV, MP3 and OOG formats.

Analysis and Assessment of Controllability of an Expressive Deep Learning-Based TTS System


In this paper, we study the controllability of an Expressive TTS system trained on a dataset for a continuous control. The dataset is the Blizzard 2013 dataset based on audiobooks read by a female speaker containing a great variability in styles and expressiveness. Controllability is evaluated with both an objective and a subjective experiment. The objective assessment is based on a measure of correlation between acoustic features and the dimensions of the latent space representing expressiveness. The subjective assessment is based on a perceptual experiment in which users are shown an interface for Controllable Expressive TTS and asked to retrieve a synthetic utterance whose expressiveness subjectively corresponds to that a reference utterance.

Listen to an AI voice actor try and flirt with you


The quality of AI-generated voices has improved rapidly in recent years, but there are still aspects of human speech that escape synthetic imitation. Sure, AI actors can deliver smooth corporate voiceovers for presentations and adverts, but more complex performances -- a convincing rendition of Hamlet, for example -- remain out of reach. Sonantic, an AI voice startup, says it's made a minor breakthrough in its development of audio deepfakes, creating a synthetic voice that can express subtleties like teasing and flirtation. The company says the key to its advance is the incorporation of non-speech sounds into its audio; training its AI models to recreate those small intakes of breath -- tiny scoffs and half-hidden chuckles -- that give real speech its stamp of biological authenticity. "We chose love as a general theme," Sonantic co-founder and CTO John Flynn tells The Verge.

Unsupervised word-level prosody tagging for controllable speech synthesis Artificial Intelligence

Although word-level prosody modeling in neural text-to-speech (TTS) has been investigated in recent research for diverse speech synthesis, it is still challenging to control speech synthesis manually without a specific reference. This is largely due to lack of word-level prosody tags. In this work, we propose a novel approach for unsupervised word-level prosody tagging with two stages, where we first group the words into different types with a decision tree according to their phonetic content and then cluster the prosodies using GMM within each type of words separately. This design is based on the assumption that the prosodies of different type of words, such as long or short words, should be tagged with different label sets. Furthermore, a TTS system with the derived word-level prosody tags is trained for controllable speech synthesis. Experiments on LJSpeech show that the TTS model trained with word-level prosody tags not only achieves better naturalness than a typical FastSpeech2 model, but also gains the ability to manipulate word-level prosody.

Windows 11 update brings two new natural voices: 'Jenny' and 'Aria'


Microsoft continues to focus on accessibility in new Windows 11 features and has now introduced two new more natural-sounding voices called Jenny and Aria. The new voices are part of Microsoft's accessibility features for Windows 11's built-in Narrator screen-reading app that can be used to read out text from websites, email and documents. The feature relies on Microsoft's work on neural text-to-speech synthesis for on-device processing. The two new voices, Jenny and Aria, will be offered when Narrator is first launched, which can be done by pressing the Windows key Ctrl Enter. It can also be configured to launch automatically at start up.