FasterVoiceGrad: Faster One-step Diffusion-Based Voice Conversion with Adversarial Diffusion Conversion Distillation

Kaneko, Takuhiro, Kameoka, Hirokazu, Tanaka, Kou, Kondo, Yuto

Aug-26-2025–arXiv.org Machine Learning

A diffusion-based voice conversion (VC) model (e.g., V oice-Grad) can achieve high speech quality and speaker similarity; however, its conversion process is slow owing to iterative sampling. FastV oiceGrad overcomes this limitation by distilling V oiceGrad into a one-step diffusion model. However, it still requires a computationally intensive content encoder to disentangle the speaker's identity and content, which slows conversion. Therefore, we propose FasterV oiceGrad, a novel one-step diffusion-based VC model obtained by simultaneously distilling a diffusion model and content encoder using adversarial diffusion conversion distillation (ADCD), where distillation is performed in the conversion process while leveraging adversarial and score distillation training. Experimental evaluations of one-shot VC demonstrated that FasterV oiceGradachieves competitive VC performance compared to FastV oiceGrad, with 6.6-6.9 and 1.8 times faster speed on a GPU and CPU, respectively.

artificial intelligence, conversion, machine learning, (16 more...)

arXiv.org Machine Learning

Aug-26-2025

arXiv.org PDF

Add feedback

Country:
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.04)

Genre:
- Research Report (0.64)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks (1.00)
  - Speech (0.95)