BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data

Łajszczak, Mateusz, Cámbara, Guillermo, Li, Yang, Beyhan, Fatih, van Korlaar, Arent, Yang, Fan, Joly, Arnaud, Martín-Cortinas, Álvaro, Abbas, Ammar, Michalski, Adam, Moinet, Alexis, Karlapati, Sri, Muszyńska, Ewa, Guo, Haohan, Putrycz, Bartosz, Gambino, Soledad López, Yoo, Kayeon, Sokolova, Elena, Drugman, Thomas

Feb-15-2024–arXiv.org Artificial Intelligence

We introduce a text-to-speech (TTS) model called BASE TTS, which stands for Big Adaptive Streamable TTS with Emergent abilities. BASE TTS is the largest TTS model to-date, trained on 100K hours of public domain speech data, achieving a new state-of-the-art in speech naturalness. It deploys a 1-billionparameter autoregressive Transformer that converts raw texts into discrete codes ("speechcodes") followed by a convolution-based decoder which converts these speechcodes into waveforms in an incremental, streamable manner. Further, our speechcodes are built using a novel speech tokenization technique that features speaker ID disentanglement and compression with byte-pair encoding. Echoing the widely-reported "emergent abilities" of large language models when trained on increasing volume of data, we show that BASE TTS variants built with 10K+ hours and 500M+ parameters begin to demonstrate natural prosody on textually complex sentences. We design and share a specialized dataset to measure these emergent abilities for text-to-speech. We showcase state-of-the-art naturalness of BASE TTS by evaluating against baselines that include publicly available large-scale text-tospeech systems: YourTTS, Bark and TortoiseTTS. Audio samples generated by the model can be heard at https://amazon-ltts-paper.com/.

arxiv preprint arxiv, representation, speechcode, (13 more...)

arXiv.org Artificial Intelligence

Feb-15-2024

arXiv.org PDF

Add feedback

Country:
- Oceania > New Zealand (0.04)
- South America > Brazil
  - Rio de Janeiro > Rio de Janeiro (0.04)
- North America > United States
  - Minnesota > Hennepin County
    - Minneapolis (0.14)
  - California > Santa Clara County
    - Sunnyvale (0.04)
- Europe
  - Austria (0.04)
  - Iceland (0.04)
  - United Kingdom > England (0.04)
  - France (0.04)
  - Czechia > Prague (0.04)
  - Northern Europe (0.04)
  - Hungary (0.04)
  - Spain > Galicia
    - Madrid (0.04)
  - Germany > Saxony
    - Dresden (0.04)
  - Italy > Calabria
    - Catanzaro Province > Catanzaro (0.04)
  - Middle East > Republic of Türkiye
    - Istanbul Province > Istanbul (0.04)
- Asia
  - Maldives (0.04)
  - Singapore (0.04)
  - South Korea > Incheon
    - Incheon (0.04)
  - Middle East
    - Jordan (0.04)
    - Republic of Türkiye > Istanbul Province
      - Istanbul (0.04)
  - Japan > Honshū
    - Kansai > Osaka Prefecture > Osaka (0.04)
  - China > Shanghai
    - Shanghai (0.04)

Genre:
- Personal (1.00)
- Research Report > Experimental Study (0.92)

Industry:
- Media (1.00)
- Leisure & Entertainment (1.00)
- Health & Medicine (0.67)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Vision > Optical Character Recognition (0.81)
  - Speech
    - Speech Synthesis (1.00)
    - Speech Recognition (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)