Synthetic-speech researchers ... have been tackling a much tougher challenge: making computers say anything a live person could say, and in a voice that sounds natural.
– from Making Computers Talk. Andy Aaron, Ellen Eide and John F. Pitrelli. Scientific American Explore (March 17, 2003)
Speech synthesis (Text-to-speech, TTS) is the formation of a speech signal from printed text. In a way, it is the opposite of speech recognition. Speech synthesis is used in medicine, dialogue systems, voice assistants and many other business tasks. As long as we have one speaker, the task of speech synthesis at first glance looks quite clear. When several speakers come into play, the situation becomes somewhat complicated and other tasks come into play; for example, voice cloning and voice conversion, this will be discussed further in the text.
Given a piece of speech and its transcript text, text-based speech editing aims to generate speech that can be seamlessly inserted into the given speech by editing the transcript. Existing methods adopt a two-stage approach: synthesize the input text using a generic text-to-speech (TTS) engine and then transform the voice to the desired voice using voice conversion (VC). A major problem of this framework is that VC is a challenging problem which usually needs a moderate amount of parallel training data to work satisfactorily. In this paper, we propose a one-stage context-aware framework to generate natural and coherent target speech without any training data of the target speaker. In particular, we manage to perform accurate zero-shot duration prediction for the inserted text. The predicted duration is used to regulate both text embedding and speech embedding. Then, based on the aligned cross-modality input, we directly generate the mel-spectrogram of the edited speech with a transformer-based decoder. Subjective listening tests show that despite the lack of training data for the speaker, our method has achieved satisfactory results. It outperforms a recent zero-shot TTS engine by a large margin.
Conversational agents are a dialogue system through NLP to respond to a given query in human language. It leverages advanced deep learning measures and natural language understanding to reach a point where conversational agents can transcend simple chatbot responses and make them more contextual. Conversational AI encompasses three main areas of artificial intelligence research -- automatic speech recognition (ASR), natural language processing (NLP), and text-to-speech (TTS or speech synthesis). These dialogue systems are utilised to read from the input channel and then reply with the relevant response in graphics, speech, or haptic-assisted physical gestures via the output channel. Modern conversational models often struggle when confronted with temporal relationships or disfluencies.The capability of temporal reasoning in dialogs in massive pre-trained language models like T5 and GPT-3 is still largely under-explored.
We describe our approach to create and deliver a custom voice for a conversational AI use-case. More specifically, we provide a voice for a Digital Einstein character, to enable human-computer interaction within the digital conversation experience. To create the voice which fits the context well, we first design a voice character and we produce the recordings which correspond to the desired speech attributes. We then model the voice. Our solution utilizes Fastspeech 2 for log-scaled mel-spectrogram prediction from phonemes and Parallel WaveGAN to generate the waveforms. The system supports a character input and gives a speech waveform at the output. We use a custom dictionary for selected words to ensure their proper pronunciation. Our proposed cloud architecture enables for fast voice delivery, making it possible to talk to the digital version of Albert Einstein in real-time.
TikTok's text-to-speech feature allows creators to put text over their videos and have a Siri-like voice read it out loud. It's a helpful way to annotate your videos to help describe what's happening, add context, or to serve whatever purpose you see fit. There's also no rule saying you can't use it just to make the text-to-speech voice say silly things. Here's how you can easily add text-to-speech to your TikTok videos. You can cancel it, edit the text, or adjust the duration of the text just by tapping the text again. Once you're happy with your video, just click "Next," apply whatever hashtags you want, and post!
We're a couple of decades into the 21st century, cars are literally starting to fly, a vacation to space is just around the corner ... and yet somehow, computers still sound like parodies of confused robots whenever asked to convert text-to-speech (TTS). Come on, devs, there has to be a better solution. A firm called WellSaid Labs believes it has one, and it's getting a boost thanks to an oversubscribed Series A. "Plain and simple, WellSaid is the future of content creation for voice. This is why thousands of customers love using the product daily with off-the-charts bottom-up adoption. Matt and Michael have assembled a world-class team, and we couldn't be more thrilled to be a part of the WellSaid journey," says Cameron Borumand, General Partner at FUSE, which led the round.
Text to speech (TTS), or speech synthesis, which aims to synthesize intelligible and natural speech given text, is a hot research topic in speech, language, and machine learning communities and has broad applications in the industry. As the development of deep learning and artificial intelligence, neural network-based TTS has significantly improved the quality of synthesized speech in recent years. In this paper, we conduct a comprehensive survey on neural TTS, aiming to provide a good understanding of current research and future trends. We focus on the key components in neural TTS, including text analysis, acoustic models and vocoders, and several advanced topics, including fast TTS, low-resource TTS, robust TTS, expressive TTS, and adaptive TTS, etc. We further summarize resources related to TTS (e.g., datasets, opensource implementations) and discuss future research directions.
Vocoders received renewed attention as main components in statistical parametric text-to-speech (TTS) synthesis and speech transformation systems. Even though there are vocoding techniques give almost accepted synthesized speech, their high computational complexity and irregular structures are still considered challenging concerns, which yield a variety of voice quality degradation. Therefore, this paper presents new techniques in a continuous vocoder, that is all features are continuous and presents a flexible speech synthesis system. First, a new continuous noise masking based on the phase distortion is proposed to eliminate the perceptual impact of the residual noise and letting an accurate reconstruction of noise characteristics. Second, we addressed the need of neural sequence to sequence modeling approach for the task of TTS based on recurrent networks. Bidirectional long short-term memory (LSTM) and gated recurrent unit (GRU) are studied and applied to model continuous parameters for more natural-sounding like a human. The evaluation results proved that the proposed model achieves the state-of-the-art performance of the speech synthesis compared with the other traditional methods.
Recently, there has been an increasing interest in neural speech synthesis. While the deep neural network achieves the state-of-the-art result in text-to-speech (TTS) tasks, how to generate a more emotional and more expressive speech is becoming a new challenge to researchers due to the scarcity of high-quality emotion speech dataset and the lack of advanced emotional TTS model. In this paper, we first briefly introduce and publicly release a Mandarin emotion speech dataset including 9,724 samples with audio files and its emotion human-labeled annotation. After that, we propose a simple but efficient architecture for emotional speech synthesis called EMSpeech. Unlike those models which need additional reference audio as input, our model could predict emotion labels just from the input text and generate more expressive speech conditioned on the emotion embedding. In the experiment phase, we first validate the effectiveness of our dataset by an emotion classification task. Then we train our model on the proposed dataset and conduct a series of subjective evaluations. Finally, by showing a comparable performance in the emotional speech synthesis task, we successfully demonstrate the ability of the proposed model.
Xbox's June update is here, and Microsoft has detailed the latest software tweaks Xbox One and Xbox Series X/S users can look forward to trying out on their consoles. To start, the company has officially implemented the speech transcription and text-to-speech synthesis tools it started testing with Xbox Insiders back in May. Now that they're part of the Xbox operating system, you can find both features in the "ease of access" setting tab under the "game and chat transcription." With speech-to-text transcription, your Xbox will transcribe and display what your party says on an adjustable overlay. With text-to-speech, meanwhile, a synthetic voice will read anything you type into party chat.