AITopics | Moinet, Alexis

Collaborating Authors

Moinet, Alexis

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data

Łajszczak, Mateusz, Cámbara, Guillermo, Li, Yang, Beyhan, Fatih, van Korlaar, Arent, Yang, Fan, Joly, Arnaud, Martín-Cortinas, Álvaro, Abbas, Ammar, Michalski, Adam, Moinet, Alexis, Karlapati, Sri, Muszyńska, Ewa, Guo, Haohan, Putrycz, Bartosz, Gambino, Soledad López, Yoo, Kayeon, Sokolova, Elena, Drugman, Thomas

arXiv.org Artificial IntelligenceFeb-15-2024

We introduce a text-to-speech (TTS) model called BASE TTS, which stands for Big Adaptive Streamable TTS with Emergent abilities. BASE TTS is the largest TTS model to-date, trained on 100K hours of public domain speech data, achieving a new state-of-the-art in speech naturalness. It deploys a 1-billionparameter autoregressive Transformer that converts raw texts into discrete codes ("speechcodes") followed by a convolution-based decoder which converts these speechcodes into waveforms in an incremental, streamable manner. Further, our speechcodes are built using a novel speech tokenization technique that features speaker ID disentanglement and compression with byte-pair encoding. Echoing the widely-reported "emergent abilities" of large language models when trained on increasing volume of data, we show that BASE TTS variants built with 10K+ hours and 500M+ parameters begin to demonstrate natural prosody on textually complex sentences. We design and share a specialized dataset to measure these emergent abilities for text-to-speech. We showcase state-of-the-art naturalness of BASE TTS by evaluating against baselines that include publicly available large-scale text-tospeech systems: YourTTS, Bark and TortoiseTTS. Audio samples generated by the model can be heard at https://amazon-ltts-paper.com/.

large language model, machine learning, speechcode, (18 more...)

arXiv.org Artificial Intelligence

2402.08093

Country:

Europe (1.00)
Asia (1.00)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre:

Personal (1.00)
Research Report > Experimental Study (0.92)

Industry:

Media (1.00)
Leisure & Entertainment (1.00)
Health & Medicine (0.67)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Synthesis (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(2 more...)

Add feedback

A Comparative Analysis of Pretrained Language Models for Text-to-Speech

Granero-Moya, Marcel, Karanasou, Penny, Karlapati, Sri, Schnell, Bastian, Peinelt, Nicole, Moinet, Alexis, Drugman, Thomas

arXiv.org Artificial IntelligenceSep-4-2023

State-of-the-art text-to-speech (TTS) systems have utilized pretrained language models (PLMs) to enhance prosody and create more natural-sounding speech. However, while PLMs have been extensively researched for natural language understanding (NLU), their impact on TTS has been overlooked. In this study, we aim to address this gap by conducting a comparative analysis of different PLMs for two TTS tasks: prosody prediction and pause prediction. Firstly, we trained a prosody prediction model using 15 different PLMs. Our findings revealed a logarithmic relationship between model size and quality, as well as significant performance differences between neutral and expressive prosody. Secondly, we employed PLMs for pause prediction and found that the task was less sensitive to small models. We also identified a strong correlation between our empirical results and the GLUE scores obtained for these language models. To the best of our knowledge, this is the first study of its kind to investigate the impact of different PLMs on TTS.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2309.01576

Country: North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.95)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.71)

Add feedback

Controllable Emphasis with zero data for text-to-speech

Joly, Arnaud, Nicolis, Marco, Peterova, Ekaterina, Lombardi, Alessandro, Abbas, Ammar, van Korlaar, Arent, Hussain, Aman, Sharma, Parul, Moinet, Alexis, Lajszczak, Mateusz, Karanasou, Penny, Bonafonte, Antonio, Drugman, Thomas, Sokolova, Elena

arXiv.org Artificial IntelligenceJul-13-2023

A popular approach consists in recording a smaller dataset featuring the desired emphasis effect in addition to the main We present a scalable method to produce high quality emphasis'neutral' recordings, and having the model learn the particular for text-to-speech (TTS) that does not require recordings or prosody associated with the emphasized words (see [5, 6, 7, 8] annotations. Many TTS models include a phoneme duration for recent examples). We build one such model as our upper model. A simple but effective method to achieve emphasized anchor, as detailed in section 2.1 speech consists in increasing the predicted duration of the emphasised While this technique works well for the speaker for which word. We show that this is significantly better than'emphasis recordings' are available, it does not directly scale spectrogram modification techniques improving naturalness by to new speakers or different languages. An alternative technique 7.3% and correct testers' identification of the emphasized word adopted with varying degrees of success consists in annotating in a sentence by 40% on a reference female en-US voice.

emphasis, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2307.07062

Country: North America > United States > New York (0.14)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.72)

Add feedback