Goto

Collaborating Authors

 text-to-speech system


A Context-Based Numerical Format Prediction for a Text-To-Speech System

Darwesh, Yaser, Wern, Lit Wei, Mustafa, Mumtaz Begum

arXiv.org Artificial Intelligence

Many of the existing TTS systems cannot accurately synthesize text containing a variety of numerical formats, resulting in reduced intelligibility of the synthesized speech. This research aims to develop a numerical format classifier that can classify six types of numeric contexts. Experiments were carried out using the proposed context-based feature extraction technique, which is focused on extracting keywords, punctuation marks, and symbols as the features of the numbers. Support Vector Machine, K-Nearest Neighbors Linear Discriminant Analysis, and Decision Tree were used as classifiers. We have used the 10-fold cross-validation technique to determine the classification accuracy in terms of recall and precision. It can be found that the proposed solution is better than the existing feature extraction technique with improvement to the classification accuracy by 30% to 37%. The use of the number format classification can increase the intelligibility of the TTS systems.


MunTTS: A Text-to-Speech System for Mundari

Gumma, Varun, Hada, Rishav, Yadavalli, Aditya, Gogoi, Pamir, Mondal, Ishani, Seshadri, Vivek, Bali, Kalika

arXiv.org Artificial Intelligence

We present MunTTS, an end-to-end text-to-speech (TTS) system specifically for Mundari, a low-resource Indian language of the Austo-Asiatic family. Our work addresses the gap in linguistic technology for underrepresented languages by collecting and processing data to build a speech synthesis system. We begin our study by gathering a substantial dataset of Mundari text and speech and train end-to-end speech models. We also delve into the methods used for training our models, ensuring they are efficient and effective despite the data constraints. We evaluate our system with native speakers and objective metrics, demonstrating its potential as a tool for preserving and promoting the Mundari language in the digital age.


I Cloned My Voice and My Mother Couldn't Tell the Difference

Slate

This article is from Understanding AI, a newsletter that explores how A.I. works and how it's changing our world. A couple of weeks ago, I used A.I. software to clone my voice. The resulting audio sounded pretty convincing to me, but I wanted to see what others thought. So I created a test audio file based on the first 12 paragraphs of this article that I wrote. Seven randomly chosen paragraphs were my real voice, while the other five were generated by A.I. I asked members of my family to see if they could tell the difference.


Deep Learning Based Assessment of Synthetic Speech Naturalness

Mittag, Gabriel, Möller, Sebastian

arXiv.org Artificial Intelligence

In this paper, we present a new objective prediction model for synthetic speech naturalness. It can be used to evaluate Text-To-Speech or Voice Conversion systems and works language independently. The model is trained end-to-end and based on a CNN-LSTM network that previously showed to give good results for speech quality estimation. We trained and tested the model on 16 different datasets, such as from the Blizzard Challenge and the Voice Conversion Challenge. Further, we show that the reliability of deep learning-based naturalness prediction can be improved by transfer learning from speech quality prediction models that are trained on objective POLQA scores. The proposed model is made publicly available and can, for example, be used to evaluate different TTS system configurations.


DeepMind and Google recreate former NFL linebacker Tim Shaw's voice using AI

#artificialintelligence

In August, Google AI researchers working with the ALS Therapy Development Institute shared details about Project Euphonia, a speech-to-text transcription service for people with speaking impairments. They showed that, using data sets of audio from both native and non-native English speakers with neurodegenerative diseases and techniques from Parrotron, an AI tool for people with impediments, they could drastically improve the quality of speech synthesis and generation. Recently, in something of a case study, Google researchers and a team from Alphabet's DeepMind employed Euphonia in an effort to recreate the original voice of Tim Shaw, a former NFL football linebacker who played for the Carolina Panthers, Jacksonville Jaguars, Chicago Bears, and Tennessee Titans before retiring in 2013. Roughly six years ago, Shaw was diagnosed with ALS, which requires him to use a wheelchair and left him unable to speak, swallow, or breathe without assistance. Over the course of six months, the joint research team adapted a generative AI model -- WaveNet -- to the task of synthesizing speech from samples of Shaw's voice prior to his ALS diagnoses.


Amazon's voice-synthesizing AI mimics shifts in tempo, pitch, and volume

#artificialintelligence

Voice assistants like Alexa convert written words into speech using text-to-speech systems, the most capable of which tap AI to verbalize from scratch rather than stringing together prerecorded snippets of sounds. Neural text-to-speech systems, or NTTS, tend to produce more natural-sounding speech than conventional models, but arguably their real value lies in their adaptability, as they're able to mimic the prosody of a recording, or its shifts in tempo, pitch, and volume. In a paper ("Fine-Grained Robust Prosody Transfer for Single-Speaker Neural Text-to-Speech") presented at this year's Interspeech conference in Graz, Austria, Amazon scientists investigated prosody transfer with a system that enabled them to choose voices in recordings while preserving the original inflections. They say it significantly improved on past attempts, which generally haven't adapted well to input voices they haven't encountered before. To this end, the team's system leveraged prosodic features that are easier to normalize than the raw spectrograms (representations of changes in signal frequency over time) typically ingested by neural text-to-speech networks.


Transition of Siri's Voice From Robotic to Human: Note the Difference - DZone AI

#artificialintelligence

Being an iOS user, how many times do you talk to Siri in a day? If you are a keen observer, then you know that Siri's voice sounds much more like a human in iOS 11 than it has before. This is because Apple is digging deeper into the technology of artificial intelligence, machine learning, and deep learning to offer the best personal assistant experience to its users. From the introduction of Siri with the iPhone 4S to its continuation in iOS 11, this personal assistant has evolved to get closer to humans and establish good relations with them. To reply to voice commands of users, Siri uses speech synthesis combined with deep learning.


Baidu's text-to-speech system mimics a variety of accents 'perfectly'

Engadget

Chinese tech giant Baidu's text-to-speech system, Deep Voice, is making a lot of progress toward sounding more human. The latest news about the tech are audio samples showcasing its ability to accurately portray differences in regional accents. The company says that the new version, aptly named Deep Voice 2, has been able to "learn from hundreds of unique voices from less than a half an hour of data per speaker, while achieving high audio quality." That's compared to the 20 hours hours of training it took to get similar results from the previous iteration, for a single voice, further pushing its efficiency past Google's WaveNet in a few months time. Baidu says that unlike previous text-to-speech systems, Deep Voice 2 finds shared qualities between the training voices entirely on its own, and without any previous guidance.


How AI researchers built a neural network that learns to speak in just a few hours

#artificialintelligence

Text-to-speech systems are familiar in the modern world in navigation apps, talking clocks, telephone answering systems, and so on. Traditionally these have been created by recording a large database of speech from a single individual and then recombining the utterances to make new phrases. The problem with these systems is that it is difficult to switch to a new speaker or change the emphasis in their words without recording an entirely new database. So computer scientists have been working on another approach. Their goal is to synthesize speech in real time from scratch as it is required.


Deep Voice: Real-Time Neural Text-to-Speech for Production - Baidu Research

#artificialintelligence

Baidu Research presents Deep Voice, a production-quality text-to-speech system constructed entirely from deep neural networks. The biggest obstacle to building such a system thus far has been the speed of audio synthesis – previous approaches have taken minutes or hours to generate only a few seconds of speech. We solve this challenge and show that we can do audio synthesis in real-time, which amounts to an up to 400X speedup over previous WaveNet inference implementations. Synthesizing artificial human speech from text, commonly known as text-to-speech (TTS), is an essential component in many applications such as speech-enabled devices, navigation systems, and accessibility for the visually-impaired. Fundamentally, it allows human-technology interaction without requiring visual interfaces.