Fast and High-Quality Auto-Regressive Speech Synthesis via Speculative Decoding

Li, Bohan, Wang, Hankun, Zhang, Situo, Guo, Yiwei, Yu, Kai

Oct-29-2024–arXiv.org Artificial Intelligence

The auto-regressive architecture, like GPTs, is widely used in modern Text-to-Speech (TTS) systems. However, it incurs substantial inference time, particularly due to the challenges in the next-token prediction posed by lengthy sequences of speech tokens. In this work, we introduce VADUSA, one of the first approaches to accelerate auto-regressive TTS through speculative decoding. Our results show that VADUSA not only significantly improves inference speed but also enhances performance by incorporating draft heads to predict future speech content auto-regressively. Furthermore, the inclusion of a tolerance mechanism during sampling accelerates inference without compromising quality. Our approach demonstrates strong generalization across large datasets and various types of speech tokens.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

Oct-29-2024

arXiv.org PDF

Add feedback

Country:
- North America > United States (0.57)
- Asia
  - Japan > Honshū
    - Kantō > Tokyo Metropolis Prefecture > Tokyo (0.04)
  - China
    - Shanghai > Shanghai (0.05)
    - Jiangsu Province (0.04)

Genre:
- Research Report > New Finding (0.86)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language (1.00)
  - Machine Learning (1.00)
  - Speech > Speech Synthesis (0.87)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found