Goto

Collaborating Authors

 Moinet, Alexis


BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data

arXiv.org Artificial Intelligence

We introduce a text-to-speech (TTS) model called BASE TTS, which stands for Big Adaptive Streamable TTS with Emergent abilities. BASE TTS is the largest TTS model to-date, trained on 100K hours of public domain speech data, achieving a new state-of-the-art in speech naturalness. It deploys a 1-billionparameter autoregressive Transformer that converts raw texts into discrete codes ("speechcodes") followed by a convolution-based decoder which converts these speechcodes into waveforms in an incremental, streamable manner. Further, our speechcodes are built using a novel speech tokenization technique that features speaker ID disentanglement and compression with byte-pair encoding. Echoing the widely-reported "emergent abilities" of large language models when trained on increasing volume of data, we show that BASE TTS variants built with 10K+ hours and 500M+ parameters begin to demonstrate natural prosody on textually complex sentences. We design and share a specialized dataset to measure these emergent abilities for text-to-speech. We showcase state-of-the-art naturalness of BASE TTS by evaluating against baselines that include publicly available large-scale text-tospeech systems: YourTTS, Bark and TortoiseTTS. Audio samples generated by the model can be heard at https://amazon-ltts-paper.com/.


A Comparative Analysis of Pretrained Language Models for Text-to-Speech

arXiv.org Artificial Intelligence

State-of-the-art text-to-speech (TTS) systems have utilized pretrained language models (PLMs) to enhance prosody and create more natural-sounding speech. However, while PLMs have been extensively researched for natural language understanding (NLU), their impact on TTS has been overlooked. In this study, we aim to address this gap by conducting a comparative analysis of different PLMs for two TTS tasks: prosody prediction and pause prediction. Firstly, we trained a prosody prediction model using 15 different PLMs. Our findings revealed a logarithmic relationship between model size and quality, as well as significant performance differences between neutral and expressive prosody. Secondly, we employed PLMs for pause prediction and found that the task was less sensitive to small models. We also identified a strong correlation between our empirical results and the GLUE scores obtained for these language models. To the best of our knowledge, this is the first study of its kind to investigate the impact of different PLMs on TTS.


Controllable Emphasis with zero data for text-to-speech

arXiv.org Artificial Intelligence

A popular approach consists in recording a smaller dataset featuring the desired emphasis effect in addition to the main We present a scalable method to produce high quality emphasis'neutral' recordings, and having the model learn the particular for text-to-speech (TTS) that does not require recordings or prosody associated with the emphasized words (see [5, 6, 7, 8] annotations. Many TTS models include a phoneme duration for recent examples). We build one such model as our upper model. A simple but effective method to achieve emphasized anchor, as detailed in section 2.1 speech consists in increasing the predicted duration of the emphasised While this technique works well for the speaker for which word. We show that this is significantly better than'emphasis recordings' are available, it does not directly scale spectrogram modification techniques improving naturalness by to new speakers or different languages. An alternative technique 7.3% and correct testers' identification of the emphasized word adopted with varying degrees of success consists in annotating in a sentence by 40% on a reference female en-US voice.