SaMoye: Zero-shot Singing Voice Conversion Based on Feature Disentanglement and Synthesis

Wang, Zihao, Ma, Le, Liu, Yan, Zhang, Kejun

Jul-10-2024–arXiv.org Artificial Intelligence

Singing voice conversion (SVC) aims to convert a singer's voice in a given music piece to another singer while keeping the original content. We propose an end-to-end feature disentanglement-based model, which we named SaMoye, to enable zero-shot many-to-many singing voice conversion. SaMoye disentangles the features of the singing voice into content features, timbre features, and pitch features respectively. The content features are enhanced using a GPT-based model to perform cross-prediction with the phoneme of the lyrics. SaMoye can generate the music with converted voice by replacing the timbre features with the target singer. We also establish an unparalleled large-scale dataset to guarantee zero-shot performance. The dataset consists of 1500k pure singing vocal clips containing at least 10,000 singers.

feature disentanglement and synthesis, voice conversion, zero-shot singing voice conversion, (9 more...)

arXiv.org Artificial Intelligence

Jul-10-2024

arXiv.org PDF

Add feedback

Country:
- Asia > China
  - Zhejiang Province > Hangzhou (0.04)
  - Beijing > Beijing (0.04)

Genre:
- Research Report (0.41)

Industry:
- Media > Music (0.47)
- Leisure & Entertainment (0.47)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning (1.00)
  - Speech (0.98)
  - Natural Language > Large Language Model (0.86)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found