RT-VC: Real-Time Zero-Shot Voice Conversion with Speech Articulatory Coding

Liu, Yisi, Wang, Chenyang, Kim, Hanjo, Khan, Raniya, Anumanchipalli, Gopala

Jun-13-2025–arXiv.org Artificial Intelligence

Voice conversion has emerged as a pivotal technology in numerous applications ranging from assistive communication to entertainment. In this paper, we present RT-VC, a zero-shot real-time voice conversion system that delivers ultra-low latency and high-quality performance. Our approach leverages an articulatory feature space to naturally disentangle content and speaker characteristics, facilitating more robust and interpretable voice transformations. Additionally, the integration of differentiable digital signal processing (DDSP) enables efficient vocoding directly from articulatory features, significantly reducing conversion latency. Experimental evaluations demonstrate that, while maintaining synthesis quality comparable to the current state-of-the-art (SOTA) method, RT-VC achieves a CPU latency of 61.4 ms, representing a 13.3\% reduction in latency.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

Jun-13-2025

arXiv.org PDF

Add feedback

Genre:
- Research Report (0.82)

Technology:
- Information Technology > Artificial Intelligence
  - Speech (1.00)
  - Machine Learning > Neural Networks (1.00)
  - Natural Language > Large Language Model (0.72)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found