Fast and High-Quality Auto-Regressive Speech Synthesis via Speculative Decoding
Li, Bohan, Wang, Hankun, Zhang, Situo, Guo, Yiwei, Yu, Kai
–arXiv.org Artificial Intelligence
The auto-regressive architecture, like GPTs, is widely used in modern Text-to-Speech (TTS) systems. However, it incurs substantial inference time, particularly due to the challenges in the next-token prediction posed by lengthy sequences of speech tokens. In this work, we introduce VADUSA, one of the first approaches to accelerate auto-regressive TTS through speculative decoding. Our results show that VADUSA not only significantly improves inference speed but also enhances performance by incorporating draft heads to predict future speech content auto-regressively. Furthermore, the inclusion of a tolerance mechanism during sampling accelerates inference without compromising quality. Our approach demonstrates strong generalization across large datasets and various types of speech tokens.
arXiv.org Artificial Intelligence
Oct-29-2024
- Country:
- Asia
- China
- Jiangsu Province (0.04)
- Shanghai > Shanghai (0.05)
- Japan > Honshū
- Kantō > Tokyo Metropolis Prefecture > Tokyo (0.04)
- China
- North America > United States (0.57)
- Asia
- Genre:
- Research Report > New Finding (0.86)
- Technology:
- Information Technology > Artificial Intelligence
- Machine Learning (1.00)
- Natural Language (1.00)
- Speech > Speech Synthesis (0.87)
- Information Technology > Artificial Intelligence