Prosody-Adaptable Audio Codecs for Zero-Shot Voice Conversion via In-Context Learning

Sep-30-2025–arXiv.org Artificial Intelligence

Recent advances in discrete audio codecs have significantly improved speech representation modeling, while codec language models have enabled in-context learning for zero-shot speech synthesis. Inspired by this, we propose a voice conversion (VC) model within the V ALLE-X framework, leveraging its strong in-context learning capabilities for speaker adaptation. To enhance prosody control, we introduce a prosody-aware audio codec encoder (P ACE) module, which isolates and refines prosody from other sources, improving expressiveness and control. By integrating P ACE into our VC model, we achieve greater flexibility in prosody manipulation while preserving speaker timbre. Experimental evaluation results demonstrate that our approach outperforms baseline VC systems in prosody preservation, timbre consistency, and overall naturalness, surpassing baseline VC systems.

large language model, machine learning, voice conversion, (19 more...)

arXiv.org Artificial Intelligence

Sep-30-2025

arXiv.org PDF

Add feedback

Country:
- Asia > Singapore (0.14)

Genre:
- Research Report > New Finding (0.66)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks (0.94)
  - Speech > Speech Recognition (0.94)
  - Natural Language > Large Language Model (0.64)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found