Collaborating Authors

Speech Synthesis

KT restores voices of Lou Gehrig's disease patients with artificial intelligence


KT, a major major telecom company in South Korea, has restored the voices of eight patients with Lou Gehrig's disease using artificial intelligence-based speech synthesis technology. Patients can communicate with their friends and family members using a smartphone app that converts written text into their voices within a second. Lou Gehrig's disease, also known as amyotrophic lateral sclerosis (ALS), is a neurodegenerative disease that damages nerve cells in the brain and spinal cord. It can eventually affect patients' movements including walking and swallowing due to the loss of voluntary muscle control. Patients can lose their voices as the muscles of their tongues weaken.

A lifetime subscription to this intuitive text-to-speech software is on sale for under £30


TL;DR: A lifetime subscription to TexTalky AI Text-to-Speech is on sale for £28.08, saving you 93% on list price. From marketing content and video narration to customer support and tutorials, there are many instances in today's marketplace when a professional human voice is needed. But due to time constraints, lack of proper recording equipment, or simply the fact you hate your voice, you may turn to a text-to-speech software. Sometimes the robotic voices from these apps leave a lot to be desired. TexTalky AI Text-to-Speech aims to convert your text to lifelike human voices in just a few seconds.

Make your content available in over 128 languages and dialects for only $37


The internet has made the world ever smaller since its inception, but many new platforms have sprung to life in just the last decade or so. Between podcasts, webinars, and new forms of social media, you can now reach more people in a variety of mediums. And now you can offer your content in multiple languages to extend that reach even further with a lifetime subscription to TexTalky, which is on sale at the moment for only $37. Cloud-based TexTalky is a natural human voice text-to-speech synthesizer powered by artificial intelligence. The platform is all online, so you don't have to install any apps to use it. You can even quickly share your files by exporting them in WAV, MP3 and OOG formats.

Listen to an AI voice actor try and flirt with you


The quality of AI-generated voices has improved rapidly in recent years, but there are still aspects of human speech that escape synthetic imitation. Sure, AI actors can deliver smooth corporate voiceovers for presentations and adverts, but more complex performances -- a convincing rendition of Hamlet, for example -- remain out of reach. Sonantic, an AI voice startup, says it's made a minor breakthrough in its development of audio deepfakes, creating a synthetic voice that can express subtleties like teasing and flirtation. The company says the key to its advance is the incorporation of non-speech sounds into its audio; training its AI models to recreate those small intakes of breath -- tiny scoffs and half-hidden chuckles -- that give real speech its stamp of biological authenticity. "We chose love as a general theme," Sonantic co-founder and CTO John Flynn tells The Verge.

Unsupervised word-level prosody tagging for controllable speech synthesis Artificial Intelligence

Although word-level prosody modeling in neural text-to-speech (TTS) has been investigated in recent research for diverse speech synthesis, it is still challenging to control speech synthesis manually without a specific reference. This is largely due to lack of word-level prosody tags. In this work, we propose a novel approach for unsupervised word-level prosody tagging with two stages, where we first group the words into different types with a decision tree according to their phonetic content and then cluster the prosodies using GMM within each type of words separately. This design is based on the assumption that the prosodies of different type of words, such as long or short words, should be tagged with different label sets. Furthermore, a TTS system with the derived word-level prosody tags is trained for controllable speech synthesis. Experiments on LJSpeech show that the TTS model trained with word-level prosody tags not only achieves better naturalness than a typical FastSpeech2 model, but also gains the ability to manipulate word-level prosody.

This scanner pen turns text to speech, translates words, and more


TL;DR: As of Feb. 11, you can slash 37% off this NEWYES Scan Reader Pen 3 Text-to-Speech OCR Multilingual Translator and get it for $124.99 instead of $199. If you are studying a second language, taking lots of notes for work or school, struggle with written text, or just want an easier way to get through the stack of books on your nightstand, there are tools that can help you out. One that's making its mark -- and happens to be on sale -- is the NEWYES Scan Reader Pen 3. The NEWYES Scan opens up new possibilities for learning. You can use it to read and retain information, translate words and phrases, look up words on the spot, capture quotes and transfer them to your computer, or even record audio to review later. This text-to-speech reader pen recognizes 3,000 characters per minute and translates in 0.3 seconds with 98 percent accuracy.

Windows 11 update brings two new natural voices: 'Jenny' and 'Aria'


Microsoft continues to focus on accessibility in new Windows 11 features and has now introduced two new more natural-sounding voices called Jenny and Aria. The new voices are part of Microsoft's accessibility features for Windows 11's built-in Narrator screen-reading app that can be used to read out text from websites, email and documents. The feature relies on Microsoft's work on neural text-to-speech synthesis for on-device processing. The two new voices, Jenny and Aria, will be offered when Narrator is first launched, which can be done by pressing the Windows key Ctrl Enter. It can also be configured to launch automatically at start up.

The MSXF TTS System for ICASSP 2022 ADD Challenge Artificial Intelligence

This paper presents our MSXF TTS system for Task 3.1 of the Audio Deep Synthesis Detection (ADD) Challenge 2022. We use an end to end text to speech system, and add a constraint loss to the system when training stage. The end to end TTS system is VITS, and the pre-training self-supervised model is wav2vec 2.0. And we also explore the influence of the speech speed and volume in spoofing. The faster speech means the less the silence part in audio, the easier to fool the detector. We also find the smaller the volume, the better spoofing ability, though we normalize volume for submission. Our team is identified as C2, and we got the fourth place in the challenge.