AdaptVC: High Quality Voice Conversion with Adaptive Learning

Kim, Jaehun, Kim, Ji-Hoon, Choi, Yeunju, Nguyen, Tan Dat, Mun, Seongkyu, Chung, Joon Son

Jan-14-2025–arXiv.org Artificial Intelligence

The goal of voice conversion is to transform the speech of a source speaker to sound like that of a reference speaker while preserving the original content. A key challenge is to extract disentangled linguistic content from the source and voice style from the reference. While existing approaches leverage various methods to isolate the two, a generalization still requires further attention, especially for robustness in zero-shot scenarios. In this paper, we achieve successful disentanglement of content and speaker features by tuning self-supervised speech features with adapters. The adapters are trained to dynamically encode nuanced features from rich self-supervised features, and the decoder fuses them to produce speech that accurately resembles the reference with minimal loss of content. Moreover, we leverage a conditional flow matching decoder with cross-attention speaker conditioning to further boost the synthesis quality and efficiency. Subjective and objective evaluations in a zero-shot scenario demonstrate that the proposed method outperforms existing models in speech quality and similarity to the reference speech.

adapter, information, proc, (15 more...)

arXiv.org Artificial Intelligence

Jan-14-2025

arXiv.org PDF

Add feedback

Country:
- Europe > Romania
  - Sud - Muntenia Development Region > Giurgiu County > Giurgiu (0.04)
- Asia
  - South Korea (0.04)
  - Japan > Honshū
    - Kantō
      - Tokyo Metropolis Prefecture > Tokyo (0.04)
      - Kanagawa Prefecture (0.04)

Genre:
- Research Report (0.50)

Technology:
- Information Technology > Artificial Intelligence
  - Speech (1.00)
  - Machine Learning (1.00)
  - Natural Language > Large Language Model (0.70)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found