AITopics | Lajszczak, Mateusz

Collaborating Authors

Lajszczak, Mateusz

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Enhancing the Stability of LLM-based Speech Generation Systems through Self-Supervised Representations

Martín-Cortinas, Álvaro, Sáez-Trigueros, Daniel, Vallés-Pérez, Iván, Tura-Vecino, Biel, Biliński, Piotr, Lajszczak, Mateusz, Beringer, Grzegorz, Barra-Chicote, Roberto, Lorenzo-Trueba, Jaime

arXiv.org Artificial IntelligenceFeb-5-2024

Large Language Models (LLMs) are one of the most promising technologies for the next era of speech generation systems, due to their scalability and in-context learning capabilities. Nevertheless, they suffer from multiple stability issues at inference time, such as hallucinations, content skipping or speech repetitions. In this work, we introduce a new self-supervised Voice Conversion (VC) architecture which can be used to learn to encode transitory features, such as content, separately from stationary ones, such as speaker ID or recording conditions, creating speaker-disentangled representations. Using speaker-disentangled codes to train LLMs for text-to-speech (TTS) allows the LLM to generate the content and the style of the speech only from the text, similarly to humans, while the speaker identity is provided by the decoder of the VC model. Results show that LLMs trained over speaker-disentangled self-supervised representations provide an improvement of 4.7pp in speaker similarity over SOTA entangled representations, and a word error rate (WER) 5.4pp lower. Furthermore, they achieve higher naturalness than human recordings of the LibriTTS test-other dataset. Finally, we show that using explicit reference embedding negatively impacts intelligibility (stability), with WER increasing by 14pp compared to the model that only uses text to infer the style.

artificial intelligence, large language model, natural language, (14 more...)

arXiv.org Artificial Intelligence

2402.03407

Country: Europe (0.14)

Genre: Research Report > New Finding (0.34)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Controllable Emphasis with zero data for text-to-speech

Joly, Arnaud, Nicolis, Marco, Peterova, Ekaterina, Lombardi, Alessandro, Abbas, Ammar, van Korlaar, Arent, Hussain, Aman, Sharma, Parul, Moinet, Alexis, Lajszczak, Mateusz, Karanasou, Penny, Bonafonte, Antonio, Drugman, Thomas, Sokolova, Elena

arXiv.org Artificial IntelligenceJul-13-2023

A popular approach consists in recording a smaller dataset featuring the desired emphasis effect in addition to the main We present a scalable method to produce high quality emphasis'neutral' recordings, and having the model learn the particular for text-to-speech (TTS) that does not require recordings or prosody associated with the emphasized words (see [5, 6, 7, 8] annotations. Many TTS models include a phoneme duration for recent examples). We build one such model as our upper model. A simple but effective method to achieve emphasized anchor, as detailed in section 2.1 speech consists in increasing the predicted duration of the emphasised While this technique works well for the speaker for which word. We show that this is significantly better than'emphasis recordings' are available, it does not directly scale spectrogram modification techniques improving naturalness by to new speakers or different languages. An alternative technique 7.3% and correct testers' identification of the emphasized word adopted with varying degrees of success consists in annotating in a sentence by 40% on a reference female en-US voice.

emphasis, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2307.07062

Country: North America > United States > New York (0.14)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.72)

Add feedback