SKQVC: One-Shot Voice Conversion by K-Means Quantization with Self-Supervised Speech Representations

Sim, Youngjun, Yoon, Jinsung, Suh, Young-Joo

arXiv.org Artificial Intelligence 

FreeVC [12] speaker identity of a source into that of an arbitrary target captures content information using SSL features, combined using only a single utterance. This process typically employs with data perturbation, a bottleneck network, and a conditional disentanglement-based methods to separate content and normalizing flow method, while employing external speaker speaker information, replacing the source speaker's information embedding to achieve high naturalness and similarity in voice with that of the target speaker. The key challenge lies conversion. in effectively disentangling content and speaker information SSL features, which are speech representations derived from while preserving both. To address this, various strategies have self-supervised learning (SSL) models such as HuBERT [14] been proposed, including information bottlenecks [1, 2], additional and WavLM [15], have demonstrated the ability to linearly loss functions [3, 4], normalization techniques [5, 6], predict various speech attributes [16]. These features are and vector quantization (VQ) methods [7-9]. VQ methods encoded such that instances of the same phone are closer capture content information by replacing the input embedding together than different phones, meaning that nearby features with the nearest vectors from a discrete codebook, which share similar phonetic content [17, 18]. Due to this inherent primarily represents phonetic features within the continuous characteristic, SSL features have been increasingly used in content space, thus removing speaker information.