DEX-TTS: Diffusion-based EXpressive Text-to-Speech with Style Modeling on Time Variability

Park, Hyun Joon, Kim, Jin Sob, Shin, Wooseok, Han, Sung Won

Jun-27-2024–arXiv.org Artificial Intelligence

Expressive Text-to-Speech (TTS) using reference speech has been studied extensively to synthesize natural speech, but there are limitations to obtaining well-represented styles and improving model generalization ability. In this study, we present Diffusion-based EXpressive TTS (DEX-TTS), an acoustic model designed for reference-based speech synthesis with enhanced style representations. Based on a general diffusion TTS framework, DEX-TTS includes encoders and adapters to handle styles extracted from reference speech. Key innovations contain the differentiation of styles into time-invariant and time-variant categories for effective style extraction, as well as the design of encoders and adapters with high generalization ability. In addition, we introduce overlapping patchify and convolution-frequency patch embedding strategies to improve DiT-based diffusion networks for TTS. DEX-TTS yields outstanding performance in terms of objective and subjective evaluation in English multi-speaker and emotional multi-speaker datasets, without relying on pre-training strategies. Lastly, the comparison results for the general TTS on a single-speaker dataset verify the effectiveness of our enhanced diffusion backbone. Demos are available here.

dataset, information, speech, (14 more...)

arXiv.org Artificial Intelligence

Jun-27-2024

arXiv.org PDF

Add feedback

Country:
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)

Genre:
- Research Report > New Finding (0.66)

Technology:
- Information Technology > Artificial Intelligence
  - Speech > Speech Synthesis (1.00)
  - Machine Learning (1.00)
  - Vision > Optical Character Recognition (0.71)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found