Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis

Jiang, Ziyue, Ren, Yi, Li, Ruiqi, Ji, Shengpeng, Ye, Zhenhui, Zhang, Chen, Jionghao, Bai, Yang, Xiaoda, Zuo, Jialong, Zhang, Yu, Liu, Rui, Yin, Xiang, Zhao, Zhou

Feb-26-2025–arXiv.org Artificial Intelligence

While recent zero-shot text-to-speech (TTS) models have significantly improved speech quality and expressiveness, mainstream systems still suffer from issues related to speech-text alignment modeling: 1) models without explicit speech-text alignment modeling exhibit less robustness, especially for hard sentences in practical applications; 2) predefined alignment-based models suffer from naturalness constraints of forced alignments. This paper introduces \textit{S-DiT}, a TTS system featuring an innovative sparse alignment algorithm that guides the latent diffusion transformer (DiT). Specifically, we provide sparse alignment boundaries to S-DiT to reduce the difficulty of alignment learning without limiting the search space, thereby achieving high naturalness. Moreover, we employ a multi-condition classifier-free guidance strategy for accent intensity adjustment and adopt the piecewise rectified flow technique to accelerate the generation process. Experiments demonstrate that S-DiT achieves state-of-the-art zero-shot TTS speech quality and supports highly flexible control over accent intensity. Notably, our system can generate high-quality one-minute speech with only 8 sampling steps. Audio samples are available at https://sditdemo.github.io/sditdemo/.

large language model, machine learning, megatts 3, (18 more...)

arXiv.org Artificial Intelligence

Feb-26-2025

arXiv.org PDF

Add feedback

Country:
- Asia (0.28)

Genre:
- Research Report > New Finding (0.68)

Industry:
- Information Technology (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning (1.00)
  - Natural Language > Large Language Model (1.00)
  - Speech > Speech Synthesis (0.88)
  - Vision (1.00)