Noro: A Noise-Robust One-shot Voice Conversion System with Hidden Speaker Representation Capabilities

He, Haorui, Song, Yuchen, Wang, Yuancheng, Li, Haoyang, Zhang, Xueyao, Wang, Li, Huang, Gongping, Chng, Eng Siong, Wu, Zhizheng

Nov-29-2024–arXiv.org Artificial Intelligence

One-shot voice conversion (VC) aims to alter the timbre of speech from a source speaker to match that of a target speaker using just a single reference speech from the target, while preserving the semantic content of the original source speech. Despite advancements in one-shot VC, its effectiveness decreases in real-world scenarios where reference speeches, often sourced from the internet, contain various disturbances like background noise. To address this issue, we introduce Noro, a Noise Robust One-shot VC system. Noro features innovative components tailored for VC using noisy reference speeches, including a dual-branch reference encoding module and a noise-agnostic contrastive speaker loss. Experimental results demonstrate that Noro outperforms our baseline system in both clean and noisy scenarios, highlighting its efficacy for real-world applications. Additionally, we investigate the hidden speaker representation capabilities of our baseline system by repurposing its reference encoder as a speaker encoder. The results shows that it is competitive with several advanced self-supervised learning models for speaker representation under the SUPERB settings, highlighting the potential for advancing speaker representation learning through one-shot VC task.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

Nov-29-2024

arXiv.org PDF

Add feedback

Country:
- Asia > China (0.46)

Genre:
- Research Report > New Finding (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.68)
  - Natural Language (1.00)
  - Speech (1.00)