MoonCast: High-Quality Zero-Shot Podcast Generation

Ju, Zeqian, Yang, Dongchao, Yu, Jianwei, Shen, Kai, Leng, Yichong, Wang, Zhengtao, Tan, Xu, Zhou, Xinyu, Qin, Tao, Li, Xiangyang

Mar-19-2025–arXiv.org Artificial Intelligence

Recent advances in text-to-speech synthesis have achieved notable success in generating high-quality short utterances for individual speakers. However, these systems still face challenges when extending their capabilities to long, multi-speaker, and spontaneous dialogues, typical of real-world scenarios such as podcasts. These limitations arise from two primary challenges: 1) long speech: podcasts typically span several minutes, exceeding the upper limit of most existing work; 2) spontaneity: podcasts are marked by their spontaneous, oral nature, which sharply contrasts with formal, written contexts; existing works often fall short in capturing this spontaneity. In this paper, we propose MoonCast, a solution for high-quality zero-shot podcast generation, aiming to synthesize natural podcast-style speech from text-only sources (e.g., stories, technical reports, news in TXT, PDF, or Web URL formats) using the voices of unseen speakers. To generate long audio, we adopt a long-context language model-based audio modeling approach utilizing large-scale long-context speech data. To enhance spontaneity, we utilize a podcast generation module to generate scripts with spontaneous details, which have been empirically shown to be as crucial as the text-to-speech modeling itself. Experiments demonstrate that MoonCast outperforms baselines, with particularly notable improvements in spontaneity and coherence.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

Mar-19-2025

arXiv.org PDF

Add feedback

Country:
- Asia > China (0.04)
- Europe > Italy
  - Calabria > Catanzaro Province > Catanzaro (0.04)

Genre:
- Research Report > Promising Solution (0.34)

Industry:
- Information Technology (0.93)

Technology:
- Information Technology
  - Communications > Mobile (1.00)
  - Artificial Intelligence
    - Speech > Speech Recognition (1.00)
    - Natural Language
      - Large Language Model (1.00)
      - Chatbot (1.00)
    - Machine Learning > Neural Networks
      - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found