Speech Synthesis


r/MachineLearning - [R] Parallel Neural Text-to-Speech

#artificialintelligence

Abstract: In this work, we propose a non-autoregressive seq2seq model that converts text to spectrogram. It is fully convolutional and obtains about 17.5 times speed-up over Deep Voice 3 at synthesis while maintaining comparable speech quality using a WaveNet vocoder. Interestingly, it has even fewer attention errors than the autoregressive model on the challenging test sentences. Furthermore, we build the first fully parallel neural text-to- speech system by applying the inverse autoregressive flow (IAF) as the parallel neural vocoder. Our system can synthesize speech from text through a single feed-forward pass.


Alexa speech normalization AI reduces errors by up to 81%

#artificialintelligence

Text normalization is a fundamental processing step in most natural language systems. In the case of Amazon's Alexa, "Book me a table at 5:00 p.m." might be transcribed by the assistant's automatic speech recognizer as "five p m" and further reformatted to "5:00PM." Then again, Alexa might convert "5:00PM" to "five thirty p m" for its text-to-speech synthesizer. So how does this work? Currently, Amazon's voice assistant relies on "thousands" of handwritten normalization rules for dates, email addresses, numbers, abbreviations, and other expressions, according to Alexa AI group applied scientist Ming Sun and Alexa Speech machine learning scientist Yuzong Liu.


Alexa speech normalization AI reduces errors by up to 81%

#artificialintelligence

Text normalization is a fundamental processing step in most natural language systems. In the case of Amazon's Alexa, "Book me a table at 5:00 p.m." might be transcribed by the assistant's automatic speech recognizer as "five p m" and further reformatted to "5:00PM." Then again, Alexa might convert "5:00PM" to "five thirty p m" for its text-to-speech synthesizer. So how does this work? Currently, Amazon's voice assistant relies on "thousands" of handwritten normalization rules for dates, email addresses, numbers, abbreviations, and other expressions, according to Alexa AI group applied scientist Ming Sun and Alexa Speech machine learning scientist Yuzong Liu.


WaveCycleGAN2: Time-domain Neural Post-filter for Speech Waveform Generation

arXiv.org Machine Learning

WaveCycleGAN has recently been proposed to bridge the gap between natural and synthesized speech waveforms in statistical parametric speech synthesis and provides fast inference with a moving average model rather than an autoregressive model and high-quality speech synthesis with the adversarial training. However, the human ear can still distinguish the processed speech waveforms from natural ones. One possible cause of this distinguishability is the aliasing observed in the processed speech waveform via down/up-sampling modules. To solve the aliasing and provide higher quality speech synthesis, we propose WaveCycleGAN2, which 1) uses generators without down/up-sampling modules and 2) combines discriminators of the waveform domain and acoustic parameter domain. The results show that the proposed method 1) alleviates the aliasing well, 2) is useful for both speech waveforms generated by analysis-and-synthesis and statistical parametric speech synthesis, and 3) achieves a mean opinion score comparable to those of natural speech and speech synthesized by WaveNet (open WaveNet) and WaveGlow while processing speech samples at a rate of more than 150 kHz on an NVIDIA Tesla P100.


OrCam - Advanced Wearable AI Devices for the Blind Closing The Gap

#artificialintelligence

The most advanced wearable assistive technology device for the blind and visually impaired, that reads text, recognizes faces, identifies products and more. Intuitively responds to simple hand gestures. Real time identification of faces is seamlessly announced. Small, lightweight, and magnetically mounts onto virtually any eyeglass frame. Tiny, wireless, and does not require an internet connection.


Unsupervised Polyglot Text To Speech

arXiv.org Machine Learning

ABSTRACT We present a TTS neural network that is able to produce speech in multiple languages. The proposed network is able to transfer a voice, which was presented as a sample in a source language, into one of several target languages. The conversion is based on learning a polyglot network that has multiple perlanguage sub-networksand adding loss terms that preserve the speaker's identity in multiple languages. We evaluate the proposed polyglot neural network for three languages with a total of more than 400 speakers and demonstrate convincing conversion capabilities. Index Terms-- TTS, multilingual, unsupervised learning 1. INTRODUCTION Neural text to speech (TTS) is an emerging technology that is becoming dominant over the alternative TTS technologies, in both quality and flexibility.


Brain implants, AI, and a speech synthesizer have turned brain activity into robot words

#artificialintelligence

Neural networks have been used to turn words that a human has heard into intelligible, recognizable speech. It could be a step toward technology that can one day decode people's thoughts. A challenge: Thanks to fMRI scanning, we've known for decades that when people speak, or hear others, it activates specific parts of their brain. However, it's proved hugely challenging to translate thoughts into words. A team from Columbia University has developed a system that combines deep learning with a speech synthesizer to do just that.


エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 TacotronとWaveNetのチュートリアル (Part 2)

#artificialintelligence

Tacotron c. Tacotron2 4. Our work: Japanese Tacotron 5. Implementation 3 4. TTS architecture: traditional pipeline - Typical pipeline architecture for statistical parametric speech synthesis - Consists of task-specific models - linguistic model - alignment (duration) model - acoustic model - vocoder 4 5. TTS architecture: End-to-end model - End-to-end model directly converts text to waveform - End-to-end model does not require intermediate feature extraction - Pipeline models accumulate errors across predicted features - End-to-end model's internal blocks are jointly optimized 5 6. [1] Wei Ping, Kainan Peng, Jitong Chen: ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech. End-to-end model: Decoding with attention mechanism 10 Time step 2. - The previous output is fed back to next time step - Assign alignment probabilities to encoded inputs 11.


Columbia Researchers Developed Technology That Can Translate Brain Activity Into Words

#artificialintelligence

Neuroengineers from Columbia University reportedly developed a way to use artificial intelligence and speech synthesizers to translate one's thoughts into words. The team's paper, published Tuesday in Scientific Reports, outlines how they created a system that can read the brain activity of a listening patient, then reiterate what that patient heard with a clarity never before seen from such technology. The breakthrough opens the path for speech neuroprosthetics, or implants, to have direct communications with the brain. Ideally, the technology will someday allow those who have lost their ability to speak to regain a voice. This can help patients who have suffered a stroke or are living with amyotrophic lateral sclerosis (ALS) communicate more easily with loved ones.


Here's what AI experts think will happen in 2019

#artificialintelligence

Another year has passed and humanity, for better or worse, remains in charge of the planet. Unfortunately for the robots, TNW has it on good authority they won't take over next year either. In the meantime, here's what the experts think will happen in 2019: Dialpad, an AI startup created by the original founders of Google Voice, tells TNW that all the hype over robot assistants that can make calls on your behalf may be a bit premature. Etienne Manderscheid, VP AI, Machine Learning, for the company says "robots may attempt to sound human next year, but this will work for few domains in 2019." Despite the hype brought on by Google Duplex and resulting conversations around speech synthesis, true text-to-speech technology will not be able to carry on conversations outside of the specific domains they're built around for at least another few years.