Autoregressive Speech Synthesis with Next-Distribution Prediction
Zhu, Xinfa, Tian, Wenjie, Xie, Lei
–arXiv.org Artificial Intelligence
We introduce KALL-E, a novel autoregressive (AR) language modeling approach with next-distribution prediction for text-to-speech (TTS) synthesis. Unlike existing methods, KALL-E directly models and predicts the continuous speech distribution conditioned on text without relying on VAE- or diffusion-based components. Specifically, we use WaveVAE to extract continuous speech distributions from waveforms instead of using discrete speech tokens. A single AR language model predicts these continuous speech distributions from text, with a Kullback-Leibler divergence loss as the constraint. Experimental results show that KALL-E outperforms open-source implementations of YourTTS, VALL-E, NaturalSpeech 2, and CosyVoice in terms of naturalness and speaker similarity in zero-shot TTS scenarios. Moreover, KALL-E demonstrates exceptional zero-shot capabilities in emotion and accent cloning. Importantly, KALL-E presents a more straightforward and effective paradigm for using continuous speech representations in TTS. Audio samples are available at: \url{https://zxf-icpc.github.io/kalle/}.
arXiv.org Artificial Intelligence
Dec-21-2024
- Country:
- South America > Colombia
- Meta Department > Villavicencio (0.04)
- Oceania > Australia
- Victoria > Melbourne (0.04)
- Queensland > Brisbane (0.04)
- North America
- United States
- Maryland > Baltimore (0.04)
- Louisiana > Orleans Parish
- New Orleans (0.04)
- Hawaii > Honolulu County
- Honolulu (0.04)
- Canada
- Ontario > Toronto (0.04)
- British Columbia > Metro Vancouver Regional District
- Vancouver (0.04)
- Alberta > Census Division No. 15
- Improvement District No. 9 > Banff (0.04)
- United States
- Europe
- Asia
- Taiwan > Taiwan Province
- Taipei (0.04)
- South Korea > Incheon
- Incheon (0.04)
- Japan > Honshū
- Tōhoku > Iwate Prefecture > Morioka (0.04)
- China > Shaanxi Province
- Xi'an (0.04)
- Taiwan > Taiwan Province
- Africa > Rwanda
- South America > Colombia
- Genre:
- Research Report > New Finding (0.48)
- Technology: