AITopics | spontaneity

MoonCast: High-Quality Zero-Shot Podcast Generation

Neural Information Processing SystemsJun-14-2026, 13:35:12 GMT

Recent advances in text-to-speech synthesis have achieved notable success in generating high-quality short utterances for individual speakers. However, these systems still face challenges when extending their capabilities to long, multi-speaker, and spontaneous dialogues, typical of real-world scenarios such as podcasts. These limitations arise from two primary challenges: 1) long speech: podcasts typically span several minutes, exceeding the upper limit of most existing work; 2) spontaneity: podcasts are marked by their spontaneous, oral nature, which sharply contrasts with formal, written contexts; existing works often fall short in capturing this spontaneity. In this paper, we propose MoonCast, a solution for high-quality zero-shot podcast generation, aiming to synthesize spontaneous podcast-style speech from text-only sources (e.g., stories, technical reports, news in TXT, PDF, or Web URL formats) using the voices of unseen speakers. To enable long audio generation, we employ a language model with parameter, data, and context scaling to process sequences in an innovative format designed for modeling entire multi-speaker, multi-turn speech interactions. To enhance spontaneity, we observe that ASR transcripts capture spontaneous speech details (e.g., filler words indicating hesitations, and specific punctuation and spaces reflecting breathing pauses), suggesting that these transcripts can serve as a partial indicator of speech spontaneity. Building upon this assumption, we utilize a script generation module to generate scripts incorporating these spontaneous elements. Experiments show MoonCast outperforms baselines, with notable improvements in contextual coherence and spontaneity.

arxiv preprint arxiv, large language model, machine learning, (20 more...)

Neural Information Processing Systems

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.93)

Industry: Information Technology > Security & Privacy (0.46)

Technology:

Information Technology > Communications > Mobile (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(2 more...)

Add feedback

MoonCast: High-Quality Zero-Shot Podcast Generation

Neural Information Processing SystemsJun-10-2026, 01:36:54 GMT

Recent advances in text-to-speech synthesis have achieved notable success in generating high-quality short utterances for individual speakers. However, these systems still face challenges when extending their capabilities to long, multi-speaker, and spontaneous dialogues, typical of real-world scenarios such as podcasts. These limitations arise from two primary challenges: 1) long speech: podcasts typically span several minutes, exceeding the upper limit of most existing work; 2) spontaneity: podcasts are marked by their spontaneous, oral nature, which sharply contrasts with formal, written contexts; existing works often fall short in capturing this spontaneity. In this paper, we propose MoonCast, a solution for high-quality zero-shot podcast generation, aiming to synthesize spontaneous podcast-style speech from text-only sources (e.g., stories, technical reports, news in TXT, PDF, or Web URL formats) using the voices of unseen speakers. To enable long audio generation, we employ a language model with parameter, data, and context scaling to process sequences in an innovative format designed for modeling entire multi-speaker, multi-turn speech interactions. To enhance spontaneity, we observe that ASR transcripts capture spontaneous speech details (e.g., filler words indicating hesitations, and specific punctuation and spaces reflecting breathing pauses), suggesting that these transcripts can serve as a partial indicator of speech spontaneity. Building upon this assumption, we utilize a script generation module to generate scripts incorporating these spontaneous elements. Experiments show MoonCast outperforms baselines, with notable improvements in contextual coherence and spontaneity.

artificial intelligence, natural language, proceedings, (6 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Natural Language (0.62)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.59)

Add feedback

Efficient Variance-reduced Estimation from Generative EHR Models: The SCOPE and REACH Estimators

Solo, Luke, McDermott, Matthew B. A., Parker, William F., Ramadan, Bashar, Burkhart, Michael C., Beaulieu-Jones, Brett K.

arXiv.org Machine LearningFeb-4-2026

Generative models trained using self-supervision of tokenized electronic health record (EHR) timelines show promise for clinical outcome prediction. This is typically done using Monte Carlo simulation for future patient trajectories. However, existing approaches suffer from three key limitations: sparse estimate distributions that poorly differentiate patient risk levels, extreme computational costs, and high sampling variance. We propose two new estimators: the Sum of Conditional Outcome Probability Estimator (SCOPE) and Risk Estimation from Anticipated Conditional Hazards (REACH), that leverage next-token probability distributions discarded by standard Monte Carlo. We prove both estimators are unbiased and that REACH guarantees variance reduction over Monte Carlo sampling for any model and outcome. Empirically, on hospital mortality prediction in MIMIC-IV using the ETHOS-ARES framework, SCOPE and REACH match 100-sample Monte Carlo performance using only 10-11 samples (95% CI: [9,11]), representing a ~10x reduction in inference cost without degrading calibration. For ICU admission prediction, efficiency gains are more modest (~1.2x), which we attribute to the outcome's lower "spontaneity," a property we characterize theoretically and empirically. These methods substantially improve the feasibility of deploying generative EHR models in resource-constrained clinical settings.

large language model, machine learning, natural language, (19 more...)

arXiv.org Machine Learning

2602.0373

Country:

North America > United States > New York > New York County > New York City (0.14)
North America > United States > Illinois > Cook County > Chicago (0.04)
Europe > Netherlands > North Holland > Amsterdam (0.04)

Genre: Research Report > Experimental Study (1.00)

Industry:

Health & Medicine > Therapeutic Area (1.00)
Health & Medicine > Health Care Technology > Medical Record (0.87)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.46)

Add feedback

Decoding the Ear: A Framework for Objectifying Expressiveness from Human Preference Through Efficient Alignment

Lin, Zhiyu, Yang, Jingwen, Zhao, Jiale, Liu, Meng, Li, Sunzhu, Wang, Benyou

arXiv.org Artificial IntelligenceOct-24-2025

Recent speech-to-speech (S2S) models generate intelligible speech but still lack natural expressiveness, largely due to the absence of a reliable evaluation metric. Existing approaches, such as subjective MOS ratings, low-level acoustic features, and emotion recognition are costly, limited, or incomplete. To address this, we present DeEAR (Decoding the Expressive Preference of eAR), a framework that converts human preference for speech expressiveness into an objective score. Grounded in phonetics and psychology, DeEAR evaluates speech across three dimensions: Emotion, Prosody, and Spontaneity, achieving strong alignment with human perception (Spearman's Rank Correlation Coefficient, SRCC = 0.86) using fewer than 500 annotated samples. Beyond reliable scoring, DeEAR enables fair benchmarking and targeted data curation. It not only distinguishes expressiveness gaps across S2S models but also selects 14K expressive utterances to form ExpressiveSpeech, which improves the expressive score (from 2.0 to 23.4 on a 100-point scale) of S2S models. Demos and codes are available at https://github.com/FreedomIntelligence/ExpressiveSpeech

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2510.20513

Country: Asia > China (0.28)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Next Token Prediction Is a Dead End for Creativity

Olatunji, Ibukun, Sheppard, Mark

arXiv.org Artificial IntelligenceMay-27-2025

This position paper argues that token prediction is fundamentally misaligned with real creativity. While next-token models have enabled impressive advances in language generation, their architecture favours surface-level coherence over spontaneity, originality, and improvisational risk. In contrast, creative acts, particularly in live performance domains, require dynamic responsiveness and stylistic divergence, enabling humans to transcend pre-learned patterns in the moment. We use battle rap as a case study to expose the limitations of predictive systems, demonstrating that they cannot truly engage in adversarial or emotionally resonant exchanges. As a result, such models fail to support the interactive flow states where human creators "lose themselves in the moment." Rather than pursuing greater predictive accuracy, we argue that AI research should embrace dialogue as a form of co-negotiated creative agency. This shift calls for approaches that prioritize real-time interaction, rhythmic alignment, and adaptive generative control. By reframing creativity as an interactive process rather than a predictive output, we offer a vision for AI systems that are more expressive, responsive, and aligned with human creative practice.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2505.19277

Country: Europe (0.46)

Genre: Research Report > New Finding (0.46)

Industry:

Leisure & Entertainment (0.93)
Media > Music (0.68)

Technology:

Information Technology > Artificial Intelligence > Cognitive Science (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.71)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
Information Technology > Artificial Intelligence > Natural Language > Generation (0.67)

Add feedback

MoonCast: High-Quality Zero-Shot Podcast Generation

Ju, Zeqian, Yang, Dongchao, Yu, Jianwei, Shen, Kai, Leng, Yichong, Wang, Zhengtao, Tan, Xu, Zhou, Xinyu, Qin, Tao, Li, Xiangyang

arXiv.org Artificial IntelligenceMar-19-2025

Recent advances in text-to-speech synthesis have achieved notable success in generating high-quality short utterances for individual speakers. However, these systems still face challenges when extending their capabilities to long, multi-speaker, and spontaneous dialogues, typical of real-world scenarios such as podcasts. These limitations arise from two primary challenges: 1) long speech: podcasts typically span several minutes, exceeding the upper limit of most existing work; 2) spontaneity: podcasts are marked by their spontaneous, oral nature, which sharply contrasts with formal, written contexts; existing works often fall short in capturing this spontaneity. In this paper, we propose MoonCast, a solution for high-quality zero-shot podcast generation, aiming to synthesize natural podcast-style speech from text-only sources (e.g., stories, technical reports, news in TXT, PDF, or Web URL formats) using the voices of unseen speakers. To generate long audio, we adopt a long-context language model-based audio modeling approach utilizing large-scale long-context speech data. To enhance spontaneity, we utilize a podcast generation module to generate scripts with spontaneous details, which have been empirically shown to be as crucial as the text-to-speech modeling itself. Experiments demonstrate that MoonCast outperforms baselines, with particularly notable improvements in spontaneity and coherence.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2503.14345

Country:

Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
Asia > China (0.04)

Genre: Research Report > Promising Solution (0.34)

Industry: Information Technology (0.93)

Technology:

Information Technology > Communications > Mobile (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(2 more...)

Add feedback

Enhancing Naturalness in LLM-Generated Utterances through Disfluency Insertion

Hassan, Syed Zohaib, Lison, Pierre, Halvorsen, Pål

arXiv.org Artificial IntelligenceDec-17-2024

Disfluencies are a natural feature of spontaneous human speech but are typically absent from the outputs of Large Language Models (LLMs). This absence can diminish the perceived naturalness of synthesized speech, which is an important criteria when building conversational agents that aim to mimick human behaviours. We show how the insertion of disfluencies can alleviate this shortcoming. The proposed approach involves (1) fine-tuning an LLM with Low-Rank Adaptation (LoRA) to incorporate various types of disfluencies into LLM-generated utterances and (2) synthesizing those utterances using a text-to-speech model that supports the generation of speech phenomena such as disfluencies. We evaluated the quality of the generated speech across two metrics: intelligibility and perceived spontaneity. We demonstrate through a user study that the insertion of disfluencies significantly increase the perceived spontaneity of the generated speech. This increase came, however, along with a slight reduction in intelligibility.

large language model, machine learning, speaker 1, (22 more...)

arXiv.org Artificial Intelligence

2412.1271

Country:

Europe > Norway > Eastern Norway > Oslo (0.05)
North America > United States > Mississippi > Mississippi County > Mississippi State (0.04)
North America > Canada > Ontario > Toronto (0.04)

Genre:

Research Report > Experimental Study (0.94)
Questionnaire & Opinion Survey (0.90)

Industry: Education (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

AI Alignment and Totalitarianism

#artificialintelligenceOct-15-2022, 02:55:13 GMT

This article looks at AI misalignment through the framework of totalitarianism, as laid out in Hannah Arendt's The Origins of Totalitarianism. I don't want to make any glib moral comparisons between the very real, singular horrors of totalitarianism in the 20th century and the still hypothetical problems of AI misalignment; but I believe the parallels are worth exploring nonetheless. In her magnum opus, Arendt describes a historical and political backdrop spawning a political movement fundamentally at odds with human flourishing, such a perverse break with previous forms of government as to constitute humanity-destroying machine. Nick Bostrom's famous paper thought experiment imagines an AGI with a mandate to make as many paperclips as possible; carried out by an all-powerful agent, this banal but unconstrained (read totalitarian) reward function results in the apocalypse. Both are powerful machines that proceed logically and implacably, without the guidance natural human intuition, towards a goal fundamentally at odds with human flourishing. A totalitarian government distinguish itself from other authoritarian forms of government (even fascist dictatorships like Mussolini's Italy) in its perpetual movement towards dominating every aspect of life.

ai alignment and totalitarianism, government, spontaneity, (8 more...)

#artificialintelligence

Country: Europe > Italy (0.25)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.51)

Add feedback

A robotic ball with playful intentions. What if you would use this?

RobohubMar-27-2021, 10:45:24 GMT

When you step closer, the ball rolls away. When you try to catch him, he escapes. This is Fizzy, an autonomous, robotic ball that is programmed to play with children. He is ambiguous, does not like to be captured but does need attention. Little wheels inside the motor make sure that the movement is unconstrained and facilitate his playful character.

hospital, playful intention, robotic ball, (7 more...)

Robohub

Country: Europe > Netherlands > South Holland > Delft (0.06)

Technology: Information Technology > Artificial Intelligence > Robots (1.00)

Add feedback

New technique builds animal brain–like spontaneity into AI

#artificialintelligenceNov-27-2020, 11:40:33 GMT

Internal motivations can prompt spontaneous changes in animal behavior. A recent study strives to design an artificial intelligence that can mimic animal-like actions using chaotic dynamics. A woman walking to a bus stop realizes that she forgot her keys; she suddenly turns around and runs home. Such spontaneous activities are hallmarks of animal behavior. Eager to capture the essence of the human brain, roboticists have tried to imitate these sorts of actions.

chaotic itinerancy, neural network, spontaneity, (15 more...)

#artificialintelligence

Country: