Collaborating Authors

Speech Synthesis

Text-to-Speech: One Small Step by Mankind to Create Lifelike Robots


Note: For those of you who prefer watching videos, please feel free to play the above video on the same content. While speech synthesis has come a long way since Kratzenstein's vowel organ that could produce the five vowel sounds, it is a whole'nother level of challenge to transform text to natural-sounding speech. Recent developments in deep learning have provided us a new approach to the challenge and in this article, we shall briefly introduce a mainstream text-to-speech method before the deep learning era, then explore models like WaveNet that Google's text-to-speech API service is now using for lifelike speech synthesis. If you pause and think for a moment about how you can perform text-to-speech, you would probably formulate a method that is very similar to the concatenative approach. In concatenative text-to-speech, texts are broken down into smaller units such as phonemes, and the corresponding recordings of the units are then combined to form a complete speech.

NVIDIA Jarvis Conversational AI on Python


This lecture attempts to demystify conversational AI by covering its counterparts that include, but not limited to: Automatic Speech Recognition, Natural Language Processing & Understanding, Text-to-Speech Synthesis, Intention Extraction and Identification, etc.. We use NVIDIA's Jarvis, an application framework for multimodal conversational AI services that delivers real-time performance on GPUs, to perform sophisticated conversational AI tasks. By the end of the lecture, we present a Question/Answering Demo powered by NVIDIA's Jarvis. About "True conversational AI is a voice assistant that can engage in human-like dialogue, capturing context and providing intelligent responses. Such AI models must be massive and highly complex," Sid Sharma from'What Is Conversational AI?'.

STYLER: Style Modeling with Rapidity and Robustness via SpeechDecomposition for Expressive and Controllable Neural Text to Speech Artificial Intelligence

Previous works on expressive text-to-speech (TTS) have a limitation on robustness and speed when training and inferring. Such drawbacks mostly come from autoregressive decoding, which makes the succeeding step vulnerable to preceding error. To overcome this weakness, we propose STYLER, a novel expressive text-to-speech model with parallelized architecture. Expelling autoregressive decoding and introducing speech decomposition for encoding enables speech synthesis more robust even with high style transfer performance. Moreover, our novel noise modeling approach from audio using domain adversarial training and Residual Decoding enabled style transfer without transferring noise. Our experiments prove the naturalness and expressiveness of our model from comparison with other parallel TTS models. Together we investigate our model's robustness and speed by comparison with the expressive TTS model with autoregressive decoding.

AdaSpeech: Adaptive Text to Speech for Custom Voice Artificial Intelligence

Custom voice, a specific text to speech (TTS) service in commercial speech platforms, aims to adapt a source TTS model to synthesize personal voice for a target speaker using few speech data. Custom voice presents two unique challenges for TTS adaptation: 1) to support diverse customers, the adaptation model needs to handle diverse acoustic conditions that could be very different from source speech data, and 2) to support a large number of customers, the adaptation parameters need to be small enough for each target speaker to reduce memory usage while maintaining high voice quality. In this work, we propose AdaSpeech, an adaptive TTS system for high-quality and efficient customization of new voices. We design several techniques in AdaSpeech to address the two challenges in custom voice: 1) To handle different acoustic conditions, we use two acoustic encoders to extract an utterance-level vector and a sequence of phoneme-level vectors from the target speech during training; in inference, we extract the utterance-level vector from a reference speech and use an acoustic predictor to predict the phoneme-level vectors. 2) To better trade off the adaptation parameters and voice quality, we introduce conditional layer normalization in the mel-spectrogram decoder of AdaSpeech, and fine-tune this part in addition to speaker embedding for adaptation. We pre-train the source TTS model on LibriTTS datasets and fine-tune it on VCTK and LJSpeech datasets (with different acoustic conditions from LibriTTS) with few adaptation data, e.g., 20 sentences, about 1 minute speech. Experiment results show that AdaSpeech achieves much better adaptation quality than baseline methods, with only about 5K specific parameters for each speaker, which demonstrates its effectiveness for custom voice. Audio samples are available at

Microsoft opens limited access to its neural text-to-speech AI


Microsoft is opening up limited access to a text-to-speech AI called Custom Neural Voice, which allows developers to create custom synthetic voices. The tech is part of an Azure AI service called Speech. Companies can use the tech for things like voice-powered smart assistants and devices, chatbots, online learning and reading audiobooks or news. They'll have to apply for access and gain approval from Microsoft before they can harness Custom Neural Voice. The tech can deliver more natural-sounding voices than many other text-to-speech services, according to Microsoft.

Text-Free Image-to-Speech Synthesis Using Learned Segmental Units Artificial Intelligence

In this paper we present the first model for directly synthesizing fluent, natural-sounding spoken audio captions for images that does not require natural language text as an intermediate representation or source of supervision. Instead, we connect the image captioning module and the speech synthesis module with a set of discrete, sub-word speech units that are discovered with a self-supervised visual grounding task. We conduct experiments on the Flickr8k spoken caption dataset in addition to a novel corpus of spoken audio captions collected for the popular MSCOCO dataset, demonstrating that our generated captions also capture diverse visual semantics of the images they describe. We investigate several different intermediate speech representations, and empirically find that the representation must satisfy several important properties to serve as drop-in replacements for text.

Text to speech, automation and AI: How Google is backing Middle East news providers


Google has awarded just under $2m to 21 projects in the Middle East, Turkey and Africa, following the first Google News Initiative (GNI) Innovation Challenge in the region. The move is part of a wider series of regional innovation challenges, and a global commitment from Google News to give $300m "to help journalism thrive in the digital age". A key focus for funding is "to support projects that drive digital innovation and develop new business models". Specifically in the Middle East, proposals were asked to focus on projects that "increase reader engagement and/or explore new business models to build a stronger future for journalism". Engagement was defined as a key metric, given that "engaged users are … more likely to convert to paid subscribers", while the focus on business models sought to encourage "moves which go beyond the traditional means to generate revenues".

The Zero Resource Speech Challenge 2020: Discovering discrete subword and word units Artificial Intelligence

We present the Zero Resource Speech Challenge 2020, which aims at learning speech representations from raw audio signals without any labels. It combines the data sets and metrics from two previous benchmarks (2017 and 2019) and features two tasks which tap into two levels of speech representation. The first task is to discover low bit-rate subword representations that optimize the quality of speech synthesis; the second one is to discover word-like units from unsegmented raw speech. We present the results of the twenty submitted models and discuss the implications of the main findings for unsupervised speech learning.

Investigation of learning abilities on linguistic features in sequence-to-sequence text-to-speech synthesis Machine Learning

Neural sequence-to-sequence text-to-speech synthesis (TTS) can produce high-quality speech directly from text or simple linguistic features such as phonemes. Unlike traditional pipeline TTS, the neural sequence-to-sequence TTS does not require manually annotated and complicated linguistic features such as part-of-speech tags and syntactic structures for system training. However, it must be carefully designed and well optimized so that it can implicitly extract useful linguistic features from the input features. In this paper we investigate under what conditions the neural sequence-to-sequence TTS can work well in Japanese and English along with comparisons with deep neural network (DNN) based pipeline TTS systems. Unlike past comparative studies, the pipeline systems also use autoregressive probabilistic modeling and a neural vocoder. We investigated systems from three aspects: a) model architecture, b) model parameter size, and c) language. For the model architecture aspect, we adopt modified Tacotron systems that we previously proposed and their variants using an encoder from Tacotron or Tacotron2. For the model parameter size aspect, we investigate two model parameter sizes. For the language aspect, we conduct listening tests in both Japanese and English to see if our findings can be generalized across languages. Our experiments suggest that a) a neural sequence-to-sequence TTS system should have a sufficient number of model parameters to produce high quality speech, b) it should also use a powerful encoder when it takes characters as inputs, and c) the encoder still has a room for improvement and needs to have an improved architecture to learn supra-segmental features more appropriately.

Corrective feedback, emphatic speech synthesis, visual-speech exaggeration, pronunciation learning Artificial Intelligence

To provide more discriminative feedback for the second language (L2) learners to better identify their mispronunciation, we propose a method for exaggerated visual-speech feedback in computer-assisted pronunciation training (CAPT). The speech exaggeration is realized by an emphatic speech generation neural network based on Tacotron, while the visual exaggeration is accomplished by ADC Viseme Blending, namely increasing Amplitude of movement, extending the phone's Duration and enhancing the color Contrast. User studies show that exaggerated feedback outperforms non-exaggerated version on helping learners with pronunciation identification and pronunciation improvement.