TacoLM: GaTed Attention Equipped Codec Language Model are Efficient Zero-Shot Text to Speech Synthesizers
Song, Yakun, Chen, Zhuo, Wang, Xiaofei, Ma, Ziyang, Yang, Guanrou, Chen, Xie
–arXiv.org Artificial Intelligence
Neural codec language model (LM) has demonstrated strong capability in zero-shot text-to-speech (TTS) synthesis. However, the codec LM often suffers from limitations in inference speed and stability, due to its auto-regressive nature and implicit alignment between text and audio. In this work, to handle these challenges, we introduce a new variant of neural codec LM, namely TacoLM. Specifically, TacoLM introduces a gated attention mechanism to improve the training and inference efficiency and reduce the model size. Meanwhile, an additional gated cross-attention layer is included for each decoder layer, which improves the efficiency and content accuracy of the synthesized speech. In the evaluation of the Librispeech corpus, the proposed TacoLM achieves a better word error rate, speaker similarity, and mean opinion score, with 90% fewer parameters and 5.2 times speed up, compared with VALL-E. Demo and code is available at https://ereboas.github.io/TacoLM/.
arXiv.org Artificial Intelligence
Jun-22-2024
- Country:
- Asia > China
- North America > United States (0.04)
- Genre:
- Research Report
- Experimental Study (0.46)
- New Finding (0.46)
- Research Report
- Technology: