Improving Robustness of LLM-based Speech Synthesis by Learning Monotonic Alignment
Neekhara, Paarth, Hussain, Shehzeen, Ghosh, Subhankar, Li, Jason, Valle, Rafael, Badlani, Rohan, Ginsburg, Boris
–arXiv.org Artificial Intelligence
Despite their remarkable Large Language Model (LLM) based text-to-speech (TTS) systems achievements, LLM-based TTS models suffer from have demonstrated remarkable capabilities in handling attention errors resulting in mis-aligned speech, repeating and large speech datasets and generating natural speech for new missing words, analogous to hallucinations [15, 16] exhibited speakers. However, LLM-based TTS models are not robust by LLMs in the text domain. This issue becomes more prominent as the generated output can contain repeating words, missing when the input text is challenging and contains repeating words and mis-aligned speech (referred to as hallucinations or words. For certain inputs, the probabilistic autoregressive inference attention errors), especially when the text contains multiple occurrences of LLM-based TTS models can result in looping or infinite of the same token. We examine these challenges silences [17]. This issue makes LLM-based TTS models unreliable in an encoder-decoder transformer model and find that certain for real-world applications.
arXiv.org Artificial Intelligence
Jun-25-2024