HierVST: Hierarchical Adaptive Zero-shot Voice Style Transfer

Lee, Sang-Hoon, Choi, Ha-Yeong, Oh, Hyung-Seok, Lee, Seong-Whan

Jul-30-2023–arXiv.org Artificial Intelligence

Despite rapid progress in the voice style transfer (VST) field, recent zero-shot VST systems still lack the ability to transfer the voice style of a novel speaker. In this paper, we present HierVST, a hierarchical adaptive end-to-end zero-shot VST model. Without any text transcripts, we only use the speech dataset to train the model by utilizing hierarchical variational inference and self-supervised representation. In addition, we adopt a hierarchical adaptive generator that generates the pitch representation and waveform audio sequentially. Moreover, we utilize unconditional generation to improve the speaker-relative acoustic capacity in the acoustic representation. With a hierarchical adaptive structure, the model can adapt to a novel voice style and convert speech progressively. The experimental results demonstrate that our method outperforms other VST models in zero-shot VST scenarios. Audio samples are available at \url{https://hiervst.github.io/}.

linguistic representation, representation, speech, (14 more...)

arXiv.org Artificial Intelligence

Jul-30-2023

arXiv.org PDF

Add feedback

Country:
- North America > Canada
  - Quebec > Montreal (0.04)
- Asia > South Korea
  - Seoul > Seoul (0.04)

Genre:
- Research Report (0.70)

Technology:
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found