Improving Robustness of LLM-based Speech Synthesis by Learning Monotonic Alignment

Neekhara, Paarth, Hussain, Shehzeen, Ghosh, Subhankar, Li, Jason, Valle, Rafael, Badlani, Rohan, Ginsburg, Boris

Jun-25-2024–arXiv.org Artificial Intelligence

Despite their remarkable Large Language Model (LLM) based text-to-speech (TTS) systems achievements, LLM-based TTS models suffer from have demonstrated remarkable capabilities in handling attention errors resulting in mis-aligned speech, repeating and large speech datasets and generating natural speech for new missing words, analogous to hallucinations [15, 16] exhibited speakers. However, LLM-based TTS models are not robust by LLMs in the text domain. This issue becomes more prominent as the generated output can contain repeating words, missing when the input text is challenging and contains repeating words and mis-aligned speech (referred to as hallucinations or words. For certain inputs, the probabilistic autoregressive inference attention errors), especially when the text contains multiple occurrences of LLM-based TTS models can result in looping or infinite of the same token. We examine these challenges silences [17]. This issue makes LLM-based TTS models unreliable in an encoder-decoder transformer model and find that certain for real-world applications.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

Jun-25-2024

arXiv.org PDF

Add feedback

Country:
- North America > United States (0.14)

Genre:
- Research Report > New Finding (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.47)
  - Natural Language > Large Language Model (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found