ContextSpeech: Expressive and Efficient Text-to-Speech for Paragraph Reading
Xiao, Yujia, Zhang, Shaofei, Wang, Xi, Tan, Xu, He, Lei, Zhao, Sheng, Soong, Frank K., Lee, Tan
–arXiv.org Artificial Intelligence
While state-of-the-art Text-to-Speech systems can generate natural speech of very high quality at sentence level, they still meet great challenges in speech generation for paragraph / long-form reading. Such deficiencies are due to i) ignorance of cross-sentence contextual information, and ii) high computation and memory cost for long-form synthesis. To address these issues, this work develops a lightweight yet effective TTS system, ContextSpeech. Specifically, we first design a memory-cached recurrence mechanism to incorporate global text and speech context into sentence encoding. Then we construct hierarchically-structured textual semantics to broaden the scope for global context enhancement. Additionally, we integrate linearized self-attention to improve model efficiency. Experiments show that ContextSpeech significantly improves the voice quality and prosody expressiveness in paragraph reading with competitive model efficiency. Audio samples are available at: https://contextspeech.github.io/demo/
arXiv.org Artificial Intelligence
Oct-7-2023
- Genre:
- Research Report (0.50)
- Technology:
- Information Technology > Artificial Intelligence
- Machine Learning > Neural Networks (1.00)
- Natural Language (1.00)
- Speech > Speech Synthesis (0.75)
- Vision > Optical Character Recognition (0.62)
- Information Technology > Artificial Intelligence