Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis
Jiang, Ziyue, Ren, Yi, Li, Ruiqi, Ji, Shengpeng, Ye, Zhenhui, Zhang, Chen, Jionghao, Bai, Yang, Xiaoda, Zuo, Jialong, Zhang, Yu, Liu, Rui, Yin, Xiang, Zhao, Zhou
–arXiv.org Artificial Intelligence
While recent zero-shot text-to-speech (TTS) models have significantly improved speech quality and expressiveness, mainstream systems still suffer from issues related to speech-text alignment modeling: 1) models without explicit speech-text alignment modeling exhibit less robustness, especially for hard sentences in practical applications; 2) predefined alignment-based models suffer from naturalness constraints of forced alignments. This paper introduces \textit{S-DiT}, a TTS system featuring an innovative sparse alignment algorithm that guides the latent diffusion transformer (DiT). Specifically, we provide sparse alignment boundaries to S-DiT to reduce the difficulty of alignment learning without limiting the search space, thereby achieving high naturalness. Moreover, we employ a multi-condition classifier-free guidance strategy for accent intensity adjustment and adopt the piecewise rectified flow technique to accelerate the generation process. Experiments demonstrate that S-DiT achieves state-of-the-art zero-shot TTS speech quality and supports highly flexible control over accent intensity. Notably, our system can generate high-quality one-minute speech with only 8 sampling steps. Audio samples are available at https://sditdemo.github.io/sditdemo/.
arXiv.org Artificial Intelligence
Feb-26-2025
- Country:
- Asia (0.28)
- Genre:
- Research Report > New Finding (0.68)
- Industry:
- Information Technology (0.46)
- Technology:
- Information Technology > Artificial Intelligence
- Machine Learning (1.00)
- Natural Language > Large Language Model (1.00)
- Speech > Speech Synthesis (0.88)
- Vision (1.00)
- Information Technology > Artificial Intelligence