Synthetic-speech researchers ... have been tackling a much tougher challenge: making computers say anything a live person could say, and in a voice that sounds natural.
– from Making Computers Talk. Andy Aaron, Ellen Eide and John F. Pitrelli. Scientific American Explore (March 17, 2003)
The most advanced wearable assistive technology device for the blind and visually impaired, that reads text, recognizes faces, identifies products and more. Intuitively responds to simple hand gestures. Real time identification of faces is seamlessly announced. Small, lightweight, and magnetically mounts onto virtually any eyeglass frame. Tiny, wireless, and does not require an internet connection.
ABSTRACT We present a TTS neural network that is able to produce speech in multiple languages. The proposed network is able to transfer a voice, which was presented as a sample in a source language, into one of several target languages. The conversion is based on learning a polyglot network that has multiple perlanguage sub-networksand adding loss terms that preserve the speaker's identity in multiple languages. We evaluate the proposed polyglot neural network for three languages with a total of more than 400 speakers and demonstrate convincing conversion capabilities. Index Terms-- TTS, multilingual, unsupervised learning 1. INTRODUCTION Neural text to speech (TTS) is an emerging technology that is becoming dominant over the alternative TTS technologies, in both quality and flexibility.
Neural networks have been used to turn words that a human has heard into intelligible, recognizable speech. It could be a step toward technology that can one day decode people's thoughts. A challenge: Thanks to fMRI scanning, we've known for decades that when people speak, or hear others, it activates specific parts of their brain. However, it's proved hugely challenging to translate thoughts into words. A team from Columbia University has developed a system that combines deep learning with a speech synthesizer to do just that.
Tacotron c. Tacotron2 4. Our work: Japanese Tacotron 5. Implementation 3 4. TTS architecture: traditional pipeline - Typical pipeline architecture for statistical parametric speech synthesis - Consists of task-specific models - linguistic model - alignment (duration) model - acoustic model - vocoder 4 5. TTS architecture: End-to-end model - End-to-end model directly converts text to waveform - End-to-end model does not require intermediate feature extraction - Pipeline models accumulate errors across predicted features - End-to-end model's internal blocks are jointly optimized 5 6.  Wei Ping, Kainan Peng, Jitong Chen: ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech. End-to-end model: Decoding with attention mechanism 10 Time step 2. - The previous output is fed back to next time step - Assign alignment probabilities to encoded inputs 11.
Neuroengineers from Columbia University reportedly developed a way to use artificial intelligence and speech synthesizers to translate one's thoughts into words. The team's paper, published Tuesday in Scientific Reports, outlines how they created a system that can read the brain activity of a listening patient, then reiterate what that patient heard with a clarity never before seen from such technology. The breakthrough opens the path for speech neuroprosthetics, or implants, to have direct communications with the brain. Ideally, the technology will someday allow those who have lost their ability to speak to regain a voice. This can help patients who have suffered a stroke or are living with amyotrophic lateral sclerosis (ALS) communicate more easily with loved ones.
Another year has passed and humanity, for better or worse, remains in charge of the planet. Unfortunately for the robots, TNW has it on good authority they won't take over next year either. In the meantime, here's what the experts think will happen in 2019: Dialpad, an AI startup created by the original founders of Google Voice, tells TNW that all the hype over robot assistants that can make calls on your behalf may be a bit premature. Etienne Manderscheid, VP AI, Machine Learning, for the company says "robots may attempt to sound human next year, but this will work for few domains in 2019." Despite the hype brought on by Google Duplex and resulting conversations around speech synthesis, true text-to-speech technology will not be able to carry on conversations outside of the specific domains they're built around for at least another few years.
We describe a neural network-based system for text-to-speech (TTS) synthesis that is able to generate speech audio in the voice of many different speakers, including those unseen during training. Our system consists of three independently trained components: (1) a speaker encoder network, trained on a speaker verification task using an independent dataset of noisy speech from thousands of speakers without transcripts, to generate a fixed-dimensional embedding vector from seconds of reference speech from a target speaker; (2) a sequence-to-sequence synthesis network based on Tacotron 2, which generates a mel spectrogram from text, conditioned on the speaker embedding; (3) an auto-regressive WaveNet-based vocoder that converts the mel spectrogram into a sequence of time domain waveform samples. We demonstrate that the proposed model is able to transfer the knowledge of speaker variability learned by the discriminatively-trained speaker encoder to the new task, and is able to synthesize natural speech from speakers that were not seen during training. We quantify the importance of training the speaker encoder on a large and diverse speaker set in order to obtain the best generalization performance. Finally, we show that randomly sampled speaker embeddings can be used to synthesize speech in the voice of novel speakers dissimilar from those used in training, indicating that the model has learned a high quality speaker representation.
A lightweight end-to-end acoustic system is crucial in the deployment of text-to-speech tasks. Finding one that produces good audios with small time latency and fewer errors remains a problem. In this paper, we propose a new non-autoregressive, fully parallel acoustic system that utilizes a new attention structure and a recently proposed convolutional structure. Compared with the most popular end-to-end text-to-speech systems, our acoustic system can produce equal or better quality audios with fewer errors and reach at least 10 times speed up of inference.
Amazon's Alexa continues to learn new party tricks, with the latest being a "newscaster style" speaking voice that will be launching on enabled devices in a few weeks' time. You can listen to samples of the speaking style below, and the results, well, they speak for themselves. The voice can't be mistaken for a human, but it does incorporate stresses into sentences in the same way you'd expect from a TV or radio newscaster. According to Amazon's own surveys, users prefer it to Alexa's regular speaking style when listening to articles (though getting news from smart speakers still has lots of other problems). Amazon says the new speaking style is enabled by the company's development of "neural text-to-speech" technology or NTTS.
In this paper we propose WaveGlow: a flow-based network capable of generating high quality speech from mel-spectrograms. WaveGlow combines insights from Glow and WaveNet in order to provide fast, efficient and high-quality audio synthesis, without the need for auto-regression. WaveGlow is implemented using only a single network, trained using only a single cost function: maximizing the likelihood of the training data, which makes the training procedure simple and stable. Our PyTorch implementation produces audio samples at a rate of more than 500 kHz on an NVIDIA V100 GPU. Mean Opinion Scores show that it delivers audio quality as good as the best publicly available WaveNet implementation. All code will be made publicly available online.