CoVoMix: Advancing Zero-Shot Speech Generation for Human-like Multi-talker Conversations

Mar-22-2026, 04:38:13 GMT–Neural Information Processing Systems

Recent advancements in zero-shot text-to-speech (TTS) modeling have led to significant strides in generating high-fidelity and diverse speech. However, dialogue generation, along with achieving human-like naturalness in speech, continues to be a challenge. In this paper, we introduce CoVoMix: Conversational Voice Mixture Generation, a novel model for zero-shot, human-like, multi-speaker, multi-round dialogue speech generation. CoVoMix first converts dialogue text into multiple streams of discrete tokens, with each token stream representing semantic information for individual talkers. These token streams are then fed into a flow-matching based acoustic model to generate mixed mel-spectrograms. Finally, the speech waveforms are produced using a HiFi-GAN model.

artificial intelligence, large language model, natural language, (9 more...)

Neural Information Processing Systems

Mar-22-2026, 04:38:13 GMT

Conferences Web Page

Add feedback

Genre:
- Research Report (0.39)

Technology:
- Information Technology > Artificial Intelligence
  - Speech (0.59)
  - Natural Language > Large Language Model (0.53)