Speech Synthesis


Text to speech Python Tutorial

#artificialintelligence

We can make the computer speak with Python. Given a text string, it will speak the written words in the English language. This process is called Text To Speech (TTS). Pytsx is a cross-platform text-to-speech wrapper. It uses the Google Text to Speech (TTS) API.


Amazon's Text-To-Speech AI Service Sounds More Natural And Realistic

#artificialintelligence

Amazon enhanced Polly - the cloud-based text-to-speech service - to deliver natural and realistic speech synthesis. The service can now be leveraged to present domain-specific style such as newscast and sportscast. Though text-to-speech existed for more than two decades, it is never used in mainstream media due to the lack of natural and realistic modulation. Except for automated announcements that read out from existing datastores, the technology never replaced human voice and speech. Thanks to the advancements in AI, text-to-speech has evolved to become more natural and realistic to an extent that it may be hard to distinguish it from a human voice.


Converting Text to Speech with Azure Cognitive Service's REST-Based API

#artificialintelligence

Adam Bertram is a 20-year veteran of IT. Adam focuses on DevOps, system management, and automation technologies as well as various cloud platforms. He is a Microsoft Cloud and Datacenter Management MVP and efficiency nerd that enjoys teaching others a better way to leverage automation.


RUSLAN: Russian Spoken Language Corpus for Speech Synthesis

arXiv.org Machine Learning

We present RUSLAN -- a new open Russian spoken language corpus for the text-to-speech task. RUSLAN contains 22200 audio samples with text annotations -- more than 31 hours of high-quality speech of one person -- being the largest annotated Russian corpus in terms of speech duration for a single speaker. We trained an end-to-end neural network for the text-to-speech task on our corpus and evaluated the quality of the synthesized speech using Mean Opinion Score test. Synthesized speech achieves 4.05 score for naturalness and 3.78 score for intelligibility on a 5-point MOS scale.


r/MachineLearning - [R] Parallel Neural Text-to-Speech

#artificialintelligence

Abstract: In this work, we propose a non-autoregressive seq2seq model that converts text to spectrogram. It is fully convolutional and obtains about 17.5 times speed-up over Deep Voice 3 at synthesis while maintaining comparable speech quality using a WaveNet vocoder. Interestingly, it has even fewer attention errors than the autoregressive model on the challenging test sentences. Furthermore, we build the first fully parallel neural text-to- speech system by applying the inverse autoregressive flow (IAF) as the parallel neural vocoder. Our system can synthesize speech from text through a single feed-forward pass.


Alexa speech normalization AI reduces errors by up to 81%

#artificialintelligence

Text normalization is a fundamental processing step in most natural language systems. In the case of Amazon's Alexa, "Book me a table at 5:00 p.m." might be transcribed by the assistant's automatic speech recognizer as "five p m" and further reformatted to "5:00PM." Then again, Alexa might convert "5:00PM" to "five thirty p m" for its text-to-speech synthesizer. So how does this work? Currently, Amazon's voice assistant relies on "thousands" of handwritten normalization rules for dates, email addresses, numbers, abbreviations, and other expressions, according to Alexa AI group applied scientist Ming Sun and Alexa Speech machine learning scientist Yuzong Liu.


Alexa speech normalization AI reduces errors by up to 81%

#artificialintelligence

Text normalization is a fundamental processing step in most natural language systems. In the case of Amazon's Alexa, "Book me a table at 5:00 p.m." might be transcribed by the assistant's automatic speech recognizer as "five p m" and further reformatted to "5:00PM." Then again, Alexa might convert "5:00PM" to "five thirty p m" for its text-to-speech synthesizer. So how does this work? Currently, Amazon's voice assistant relies on "thousands" of handwritten normalization rules for dates, email addresses, numbers, abbreviations, and other expressions, according to Alexa AI group applied scientist Ming Sun and Alexa Speech machine learning scientist Yuzong Liu.


WaveCycleGAN2: Time-domain Neural Post-filter for Speech Waveform Generation

arXiv.org Machine Learning

WaveCycleGAN has recently been proposed to bridge the gap between natural and synthesized speech waveforms in statistical parametric speech synthesis and provides fast inference with a moving average model rather than an autoregressive model and high-quality speech synthesis with the adversarial training. However, the human ear can still distinguish the processed speech waveforms from natural ones. One possible cause of this distinguishability is the aliasing observed in the processed speech waveform via down/up-sampling modules. To solve the aliasing and provide higher quality speech synthesis, we propose WaveCycleGAN2, which 1) uses generators without down/up-sampling modules and 2) combines discriminators of the waveform domain and acoustic parameter domain. The results show that the proposed method 1) alleviates the aliasing well, 2) is useful for both speech waveforms generated by analysis-and-synthesis and statistical parametric speech synthesis, and 3) achieves a mean opinion score comparable to those of natural speech and speech synthesized by WaveNet (open WaveNet) and WaveGlow while processing speech samples at a rate of more than 150 kHz on an NVIDIA Tesla P100.


OrCam - Advanced Wearable AI Devices for the Blind Closing The Gap

#artificialintelligence

The most advanced wearable assistive technology device for the blind and visually impaired, that reads text, recognizes faces, identifies products and more. Intuitively responds to simple hand gestures. Real time identification of faces is seamlessly announced. Small, lightweight, and magnetically mounts onto virtually any eyeglass frame. Tiny, wireless, and does not require an internet connection.


Unsupervised Polyglot Text To Speech

arXiv.org Machine Learning

ABSTRACT We present a TTS neural network that is able to produce speech in multiple languages. The proposed network is able to transfer a voice, which was presented as a sample in a source language, into one of several target languages. The conversion is based on learning a polyglot network that has multiple perlanguage sub-networksand adding loss terms that preserve the speaker's identity in multiple languages. We evaluate the proposed polyglot neural network for three languages with a total of more than 400 speakers and demonstrate convincing conversion capabilities. Index Terms-- TTS, multilingual, unsupervised learning 1. INTRODUCTION Neural text to speech (TTS) is an emerging technology that is becoming dominant over the alternative TTS technologies, in both quality and flexibility.