Synthetic-speech researchers ... have been tackling a much tougher challenge: making computers say anything a live person could say, and in a voice that sounds natural.
– from Making Computers Talk. Andy Aaron, Ellen Eide and John F. Pitrelli. Scientific American Explore (March 17, 2003)
The most advanced wearable assistive technology device for the blind and visually impaired, that reads text, recognizes faces, identifies products and more. Intuitively responds to simple hand gestures. Real time identification of faces is seamlessly announced. Small, lightweight, and magnetically mounts onto virtually any eyeglass frame. Tiny, wireless, and does not require an internet connection.
Neural networks have been used to turn words that a human has heard into intelligible, recognizable speech. It could be a step toward technology that can one day decode people's thoughts. A challenge: Thanks to fMRI scanning, we've known for decades that when people speak, or hear others, it activates specific parts of their brain. However, it's proved hugely challenging to translate thoughts into words. A team from Columbia University has developed a system that combines deep learning with a speech synthesizer to do just that.
Tacotron c. Tacotron2 4. Our work: Japanese Tacotron 5. Implementation 3 4. TTS architecture: traditional pipeline - Typical pipeline architecture for statistical parametric speech synthesis - Consists of task-specific models - linguistic model - alignment (duration) model - acoustic model - vocoder 4 5. TTS architecture: End-to-end model - End-to-end model directly converts text to waveform - End-to-end model does not require intermediate feature extraction - Pipeline models accumulate errors across predicted features - End-to-end model's internal blocks are jointly optimized 5 6.  Wei Ping, Kainan Peng, Jitong Chen: ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech. End-to-end model: Decoding with attention mechanism 10 Time step 2. - The previous output is fed back to next time step - Assign alignment probabilities to encoded inputs 11.
Neuroengineers from Columbia University reportedly developed a way to use artificial intelligence and speech synthesizers to translate one's thoughts into words. The team's paper, published Tuesday in Scientific Reports, outlines how they created a system that can read the brain activity of a listening patient, then reiterate what that patient heard with a clarity never before seen from such technology. The breakthrough opens the path for speech neuroprosthetics, or implants, to have direct communications with the brain. Ideally, the technology will someday allow those who have lost their ability to speak to regain a voice. This can help patients who have suffered a stroke or are living with amyotrophic lateral sclerosis (ALS) communicate more easily with loved ones.
Another year has passed and humanity, for better or worse, remains in charge of the planet. Unfortunately for the robots, TNW has it on good authority they won't take over next year either. In the meantime, here's what the experts think will happen in 2019: Dialpad, an AI startup created by the original founders of Google Voice, tells TNW that all the hype over robot assistants that can make calls on your behalf may be a bit premature. Etienne Manderscheid, VP AI, Machine Learning, for the company says "robots may attempt to sound human next year, but this will work for few domains in 2019." Despite the hype brought on by Google Duplex and resulting conversations around speech synthesis, true text-to-speech technology will not be able to carry on conversations outside of the specific domains they're built around for at least another few years.
We describe a neural network-based system for text-to-speech (TTS) synthesis that is able to generate speech audio in the voice of many different speakers, including those unseen during training. Our system consists of three independently trained components: (1) a speaker encoder network, trained on a speaker verification task using an independent dataset of noisy speech from thousands of speakers without transcripts, to generate a fixed-dimensional embedding vector from seconds of reference speech from a target speaker; (2) a sequence-to-sequence synthesis network based on Tacotron 2, which generates a mel spectrogram from text, conditioned on the speaker embedding; (3) an auto-regressive WaveNet-based vocoder that converts the mel spectrogram into a sequence of time domain waveform samples. We demonstrate that the proposed model is able to transfer the knowledge of speaker variability learned by the discriminatively-trained speaker encoder to the new task, and is able to synthesize natural speech from speakers that were not seen during training. We quantify the importance of training the speaker encoder on a large and diverse speaker set in order to obtain the best generalization performance. Finally, we show that randomly sampled speaker embeddings can be used to synthesize speech in the voice of novel speakers dissimilar from those used in training, indicating that the model has learned a high quality speaker representation.
A lightweight end-to-end acoustic system is crucial in the deployment of text-to-speech tasks. Finding one that produces good audios with small time latency and fewer errors remains a problem. In this paper, we propose a new non-autoregressive, fully parallel acoustic system that utilizes a new attention structure and a recently proposed convolutional structure. Compared with the most popular end-to-end text-to-speech systems, our acoustic system can produce equal or better quality audios with fewer errors and reach at least 10 times speed up of inference.
Amazon's Alexa continues to learn new party tricks, with the latest being a "newscaster style" speaking voice that will be launching on enabled devices in a few weeks' time. You can listen to samples of the speaking style below, and the results, well, they speak for themselves. The voice can't be mistaken for a human, but it does incorporate stresses into sentences in the same way you'd expect from a TV or radio newscaster. According to Amazon's own surveys, users prefer it to Alexa's regular speaking style when listening to articles (though getting news from smart speakers still has lots of other problems). Amazon says the new speaking style is enabled by the company's development of "neural text-to-speech" technology or NTTS.
In this paper we propose WaveGlow: a flow-based network capable of generating high quality speech from mel-spectrograms. WaveGlow combines insights from Glow and WaveNet in order to provide fast, efficient and high-quality audio synthesis, without the need for auto-regression. WaveGlow is implemented using only a single network, trained using only a single cost function: maximizing the likelihood of the training data, which makes the training procedure simple and stable. Our PyTorch implementation produces audio samples at a rate of more than 500 kHz on an NVIDIA V100 GPU. Mean Opinion Scores show that it delivers audio quality as good as the best publicly available WaveNet implementation. All code will be made publicly available online.
ABSTRACT End-to-end speech synthesis is a promising approach that directly converts raw text to speech. Although it was shown that Tacotron2 outperforms classical pipeline systems with regards to naturalness in English, its applicability to other languages is still unknown. Japanese could be one of the most difficult languages for which to achieve end-to-end speech synthesis, largely due to its character diversity and pitch accents. Therefore, state-of-theart systems are still based on a traditional pipeline framework that requires a separate text analyzer and duration model. Towards endto-end Japanese speech synthesis, we extend Tacotron to systems with self-attention to capture long-term dependencies related to pitch accents and compare their audio quality with classical pipeline systems under various conditions to show their pros and cons. In a large-scale listening test, we investigated the impacts of the presence of accentual-type labels, the use of force or predicted alignments, and acoustic features used as local condition parameters of the Wavenet vocoder. Our results reveal that although the proposed systems still do not match the quality of a top-line pipeline system for Japanese, we show important stepping stones towards end-to-end Japanese speech synthesis. Index Terms-- speech synthesis, deep learning, Tacotron 1. INTRODUCTION Tacotron  opened a novel path to end-to-end speech synthesis.