spontaneity
Decoding the Ear: A Framework for Objectifying Expressiveness from Human Preference Through Efficient Alignment
Lin, Zhiyu, Yang, Jingwen, Zhao, Jiale, Liu, Meng, Li, Sunzhu, Wang, Benyou
Recent speech-to-speech (S2S) models generate intelligible speech but still lack natural expressiveness, largely due to the absence of a reliable evaluation metric. Existing approaches, such as subjective MOS ratings, low-level acoustic features, and emotion recognition are costly, limited, or incomplete. To address this, we present DeEAR (Decoding the Expressive Preference of eAR), a framework that converts human preference for speech expressiveness into an objective score. Grounded in phonetics and psychology, DeEAR evaluates speech across three dimensions: Emotion, Prosody, and Spontaneity, achieving strong alignment with human perception (Spearman's Rank Correlation Coefficient, SRCC = 0.86) using fewer than 500 annotated samples. Beyond reliable scoring, DeEAR enables fair benchmarking and targeted data curation. It not only distinguishes expressiveness gaps across S2S models but also selects 14K expressive utterances to form ExpressiveSpeech, which improves the expressive score (from 2.0 to 23.4 on a 100-point scale) of S2S models. Demos and codes are available at https://github.com/FreedomIntelligence/ExpressiveSpeech
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Asia > China > Hong Kong (0.04)
- Asia > China > Guangdong Province > Shenzhen (0.04)
Next Token Prediction Is a Dead End for Creativity
Olatunji, Ibukun, Sheppard, Mark
This position paper argues that token prediction is fundamentally misaligned with real creativity. While next-token models have enabled impressive advances in language generation, their architecture favours surface-level coherence over spontaneity, originality, and improvisational risk. In contrast, creative acts, particularly in live performance domains, require dynamic responsiveness and stylistic divergence, enabling humans to transcend pre-learned patterns in the moment. We use battle rap as a case study to expose the limitations of predictive systems, demonstrating that they cannot truly engage in adversarial or emotionally resonant exchanges. As a result, such models fail to support the interactive flow states where human creators "lose themselves in the moment." Rather than pursuing greater predictive accuracy, we argue that AI research should embrace dialogue as a form of co-negotiated creative agency. This shift calls for approaches that prioritize real-time interaction, rhythmic alignment, and adaptive generative control. By reframing creativity as an interactive process rather than a predictive output, we offer a vision for AI systems that are more expressive, responsive, and aligned with human creative practice.
- Europe > Spain > Galicia > Madrid (0.04)
- Europe > United Kingdom > England > Kent > Canterbury (0.04)
- Europe > Switzerland (0.04)
- Asia > Philippines (0.04)
- Leisure & Entertainment (0.93)
- Media > Music (0.68)
- Information Technology > Artificial Intelligence > Cognitive Science (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.71)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
- Information Technology > Artificial Intelligence > Natural Language > Generation (0.67)
MoonCast: High-Quality Zero-Shot Podcast Generation
Ju, Zeqian, Yang, Dongchao, Yu, Jianwei, Shen, Kai, Leng, Yichong, Wang, Zhengtao, Tan, Xu, Zhou, Xinyu, Qin, Tao, Li, Xiangyang
Recent advances in text-to-speech synthesis have achieved notable success in generating high-quality short utterances for individual speakers. However, these systems still face challenges when extending their capabilities to long, multi-speaker, and spontaneous dialogues, typical of real-world scenarios such as podcasts. These limitations arise from two primary challenges: 1) long speech: podcasts typically span several minutes, exceeding the upper limit of most existing work; 2) spontaneity: podcasts are marked by their spontaneous, oral nature, which sharply contrasts with formal, written contexts; existing works often fall short in capturing this spontaneity. In this paper, we propose MoonCast, a solution for high-quality zero-shot podcast generation, aiming to synthesize natural podcast-style speech from text-only sources (e.g., stories, technical reports, news in TXT, PDF, or Web URL formats) using the voices of unseen speakers. To generate long audio, we adopt a long-context language model-based audio modeling approach utilizing large-scale long-context speech data. To enhance spontaneity, we utilize a podcast generation module to generate scripts with spontaneous details, which have been empirically shown to be as crucial as the text-to-speech modeling itself. Experiments demonstrate that MoonCast outperforms baselines, with particularly notable improvements in spontaneity and coherence.
Enhancing Naturalness in LLM-Generated Utterances through Disfluency Insertion
Hassan, Syed Zohaib, Lison, Pierre, Halvorsen, Pål
Disfluencies are a natural feature of spontaneous human speech but are typically absent from the outputs of Large Language Models (LLMs). This absence can diminish the perceived naturalness of synthesized speech, which is an important criteria when building conversational agents that aim to mimick human behaviours. We show how the insertion of disfluencies can alleviate this shortcoming. The proposed approach involves (1) fine-tuning an LLM with Low-Rank Adaptation (LoRA) to incorporate various types of disfluencies into LLM-generated utterances and (2) synthesizing those utterances using a text-to-speech model that supports the generation of speech phenomena such as disfluencies. We evaluated the quality of the generated speech across two metrics: intelligibility and perceived spontaneity. We demonstrate through a user study that the insertion of disfluencies significantly increase the perceived spontaneity of the generated speech. This increase came, however, along with a slight reduction in intelligibility.
- Europe > Norway > Eastern Norway > Oslo (0.05)
- North America > United States > Mississippi > Mississippi County > Mississippi State (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- Research Report > Experimental Study (0.94)
- Questionnaire & Opinion Survey (0.90)
AI Alignment and Totalitarianism
This article looks at AI misalignment through the framework of totalitarianism, as laid out in Hannah Arendt's The Origins of Totalitarianism. I don't want to make any glib moral comparisons between the very real, singular horrors of totalitarianism in the 20th century and the still hypothetical problems of AI misalignment; but I believe the parallels are worth exploring nonetheless. In her magnum opus, Arendt describes a historical and political backdrop spawning a political movement fundamentally at odds with human flourishing, such a perverse break with previous forms of government as to constitute humanity-destroying machine. Nick Bostrom's famous paper thought experiment imagines an AGI with a mandate to make as many paperclips as possible; carried out by an all-powerful agent, this banal but unconstrained (read totalitarian) reward function results in the apocalypse. Both are powerful machines that proceed logically and implacably, without the guidance natural human intuition, towards a goal fundamentally at odds with human flourishing. A totalitarian government distinguish itself from other authoritarian forms of government (even fascist dictatorships like Mussolini's Italy) in its perpetual movement towards dominating every aspect of life.
A robotic ball with playful intentions. What if you would use this?
When you step closer, the ball rolls away. When you try to catch him, he escapes. This is Fizzy, an autonomous, robotic ball that is programmed to play with children. He is ambiguous, does not like to be captured but does need attention. Little wheels inside the motor make sure that the movement is unconstrained and facilitate his playful character.
New technique builds animal brain–like spontaneity into AI
Internal motivations can prompt spontaneous changes in animal behavior. A recent study strives to design an artificial intelligence that can mimic animal-like actions using chaotic dynamics. A woman walking to a bus stop realizes that she forgot her keys; she suddenly turns around and runs home. Such spontaneous activities are hallmarks of animal behavior. Eager to capture the essence of the human brain, roboticists have tried to imitate these sorts of actions.
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.07)
- North America (0.05)
- Europe > France > Île-de-France > Yvelines > Cergy-Pontoise (0.05)
- Europe > France > Île-de-France > Val-d'Oise > Cergy-Pontoise (0.05)
Researchers use dynamical systems and machine learning to add spontaneity to AI
Autonomous functions for robots, such as spontaneity, are highly sought after. Many control mechanisms for autonomous robots are inspired by the functions of animals, including humans. Roboticists often design robot behaviors using predefined modules and control methodologies, which makes them task-specific, limiting their flexibility. Researchers offer an alternative machine learning-based method for designing spontaneous behaviors by capitalizing on complex temporal patterns, like neural activities of animal brains. They hope to see their design implemented in robotic platforms to improve their autonomous capabilities.
Could your TOASTER help you find love? Smart home devices may match you to the perfect person by 2026
While online dating was once seen as a last resort for meeting a partner, one in five relationships now starts online. And it appears that the future may extend the ways to use technology to find love even further. A new study suggests that by 2026, smart home devices, including your toaster and wardrobe, could help you find love. As for smart wardrobes, eHarmony's research indicates clothing style is actually an extremely precise and detailed reflection of a person's personality. Smart appliances such as fridges, toasters, coffee makers and cooking devices could reveal a large amount of information about our diet, meal times and even spontaneity when choosing or preparing food.