Collaborating Authors

Speech Synthesis

Text to speech, automation and AI: How Google is backing Middle East news providers


Google has awarded just under $2m to 21 projects in the Middle East, Turkey and Africa, following the first Google News Initiative (GNI) Innovation Challenge in the region. The move is part of a wider series of regional innovation challenges, and a global commitment from Google News to give $300m "to help journalism thrive in the digital age". A key focus for funding is "to support projects that drive digital innovation and develop new business models". Specifically in the Middle East, proposals were asked to focus on projects that "increase reader engagement and/or explore new business models to build a stronger future for journalism". Engagement was defined as a key metric, given that "engaged users are … more likely to convert to paid subscribers", while the focus on business models sought to encourage "moves which go beyond the traditional means to generate revenues".

The Zero Resource Speech Challenge 2020: Discovering discrete subword and word units Artificial Intelligence

We present the Zero Resource Speech Challenge 2020, which aims at learning speech representations from raw audio signals without any labels. It combines the data sets and metrics from two previous benchmarks (2017 and 2019) and features two tasks which tap into two levels of speech representation. The first task is to discover low bit-rate subword representations that optimize the quality of speech synthesis; the second one is to discover word-like units from unsegmented raw speech. We present the results of the twenty submitted models and discuss the implications of the main findings for unsupervised speech learning.

Investigation of learning abilities on linguistic features in sequence-to-sequence text-to-speech synthesis Machine Learning

Neural sequence-to-sequence text-to-speech synthesis (TTS) can produce high-quality speech directly from text or simple linguistic features such as phonemes. Unlike traditional pipeline TTS, the neural sequence-to-sequence TTS does not require manually annotated and complicated linguistic features such as part-of-speech tags and syntactic structures for system training. However, it must be carefully designed and well optimized so that it can implicitly extract useful linguistic features from the input features. In this paper we investigate under what conditions the neural sequence-to-sequence TTS can work well in Japanese and English along with comparisons with deep neural network (DNN) based pipeline TTS systems. Unlike past comparative studies, the pipeline systems also use autoregressive probabilistic modeling and a neural vocoder. We investigated systems from three aspects: a) model architecture, b) model parameter size, and c) language. For the model architecture aspect, we adopt modified Tacotron systems that we previously proposed and their variants using an encoder from Tacotron or Tacotron2. For the model parameter size aspect, we investigate two model parameter sizes. For the language aspect, we conduct listening tests in both Japanese and English to see if our findings can be generalized across languages. Our experiments suggest that a) a neural sequence-to-sequence TTS system should have a sufficient number of model parameters to produce high quality speech, b) it should also use a powerful encoder when it takes characters as inputs, and c) the encoder still has a room for improvement and needs to have an improved architecture to learn supra-segmental features more appropriately.

Corrective feedback, emphatic speech synthesis, visual-speech exaggeration, pronunciation learning Artificial Intelligence

To provide more discriminative feedback for the second language (L2) learners to better identify their mispronunciation, we propose a method for exaggerated visual-speech feedback in computer-assisted pronunciation training (CAPT). The speech exaggeration is realized by an emphatic speech generation neural network based on Tacotron, while the visual exaggeration is accomplished by ADC Viseme Blending, namely increasing Amplitude of movement, extending the phone's Duration and enhancing the color Contrast. User studies show that exaggerated feedback outperforms non-exaggerated version on helping learners with pronunciation identification and pronunciation improvement.

Google Cloud lets businesses create their own text-to-speech voices – TechCrunch


Google launched a few updates to its Contact Center AI product today, but the most interesting one is probably the beta of its new Custom Voice service, which will let brands create their own text-to-speech voices to best represent their own brands. Maybe your company has a well-known spokesperson for example, but it would be pretty arduous to have them record every sentence in an automated response system or bring them back to the studio whenever you launch a new product or procedure. With Custom Voice, businesses can bring in their voice talent to the studio and have them record a script provided by Google. The company will then take those recordings and train its speech models based on them. As of now, this seems to be a somewhat manual task on Google's side.

Cloud Run: Google Cloud Text to Speech API


Google Cloud Run became Generally-Available (GA) around November of 2019. It provides a fully managed serverless execution platform to abstract infrastructure for stateless code deployment with HTTP-driven containers. Cloud Run is a Knative service utilizing the same APIs and runtime environments that make it possible to build container-based applications that can run anywhere, whether on Google cloud or Anthos deployed on-premises or on the Cloud. As a "serverless execution environment", Cloud Run can scale in response to the computing needs of the running application. Instant execution of application code, scalability and portability are core features of Cloud Run.

Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis


Please click here to redirect to watch our video in Youtube. In this work, we propose a sequence-to-sequence architecture for accurate speech generation from silent lip videos in unconstrained settings for the first time. The text in the bubble is manually transcribed and is shown for presentation purposes. Humans involuntarily tend to infer parts of the conversation from lip movements when the speech is absent or corrupted by external noise. In this work, we explore the task of lip to speech synthesis, i.e., learning to generate natural speech given only the lip movements of a speaker.

Text to Speech Technology: How Voice Computing is Building a More Accessible World


In a world where new technology emerges at exponential rates, and our daily lives are increasingly mediated by speakers and sound waves, text to speech technology is the latest force evolving the way we communicate. Text to speech technology refers to a field of computer science that enables the conversion of language text into audible speech. Also known as voice computing, text to speech (TTS) often involves building a database of recorded human speech to train a computer to produce sound waves that resemble the natural sound of a human speaking. This process is called speech synthesis. The technology is trailblazing and major breakthroughs in the field occur regularly.

r/MachineLearning - [2006.04558] FastSpeech 2: Fast and High-Quality End-to-End Text-to-Speech


Abstract: Advanced text-to-speech (TTS) models such as FastSpeech can synthesize speech significantly faster than previous autoregressive models with comparable quality. The training of FastSpeech model relies on an autoregressive teacher model for duration prediction (to provide more information as input) and knowledge distillation (to simplify the data distribution in output), which can ease the one-to-many mapping problem (i.e., multiple speech variations correspond to the same text) in TTS. However, FastSpeech has several disadvantages: 1) the teacher-student distillation pipeline is complicated, 2) the duration extracted from the teacher model is not accurate enough, and the target mel-spectrograms distilled from teacher model suffer from information loss due to data simplification, both of which limit the voice quality. In this paper, we propose FastSpeech 2, which addresses the issues in FastSpeech and better solves the one-to-many mapping problem in TTS by 1) directly training the model with ground-truth target instead of the simplified output from teacher, and 2) introducing more variation information of speech (e.g., pitch, energy and more accurate duration) as conditional inputs. Specifically, we extract duration, pitch and energy from speech waveform and directly take them as conditional inputs during training and use predicted values during inference. We further design FastSpeech 2s, which is the first attempt to directly generate speech waveform from text in parallel, enjoying the benefit of full end-to-end training and even faster inference than FastSpeech.

Facebook's voice synthesis AI generates speech in 500 milliseconds


Facebook today unveiled a highly efficient, AI text-to-speech (TTS) system that can be hosted in real time using regular processors. In tandem with a new data collection approach, which leverages a language model for curation, Facebook says the system -- which produces a second of audio in 500 milliseconds -- enabled it to create a British-accented voice in six months as opposed to over a year for previous voices. Most modern AI TTS systems require graphics cards, field-programmable gate arrays (FPGAs), or custom-designed AI chips like Google's tensor processing units (TPUs) to run, train, or both. For instance, a recently detailed Google AI system was trained across 32 TPUs in parallel. Synthesizing a single second of humanlike audio can require outputting as many as 24,000 samples -- sometimes even more.