Media
MuseTok: Symbolic Music Tokenization for Generation and Semantic Understanding
Huang, Jingyue, Novack, Zachary, Long, Phillip, Hou, Yupeng, Chen, Ke, Berg-Kirkpatrick, Taylor, McAuley, Julian
Discrete representation learning has shown promising results across various domains, including generation and understanding in image, speech and language. Inspired by these advances, we propose MuseTok, a tokenization method for symbolic music, and investigate its effectiveness in both music generation and understanding tasks. MuseTok employs the residual vector quantized-variational autoencoder (RQ-VAE) on bar-wise music segments within a Transformer-based encoder-decoder framework, producing music codes that achieve high-fidelity music reconstruction and accurate understanding of music theory. For comprehensive evaluation, we apply MuseTok to music generation and semantic understanding tasks, including melody extraction, chord recognition, and emotion recognition. Models incorporating MuseTok outperform previous representation learning baselines in semantic understanding while maintaining comparable performance in content generation. Furthermore, qualitative analyses on MuseTok codes, using ground-truth categories and synthetic datasets, reveal that MuseTok effectively captures underlying musical concepts from large music collections.
Towards Low-Resource Alignment to Diverse Perspectives with Sparse Feedback
Luo, Chu Fei, Dahan, Samuel, Zhu, Xiaodan
As language models have a greater impact on society, it is important to ensure they are aligned to a diverse range of perspectives and are able to reflect nuance in human values. However, the most popular training paradigms for modern language models often assume there is one optimal answer for every query, leading to generic responses and poor alignment. In this work, we aim to enhance pluralistic alignment of language models in a low-resource setting with two methods: pluralistic decoding and model steering. We empirically demonstrate that model steering offers consistent improvement over zero-shot and few-shot baselines with only 50 annotated samples. Our proposed methods decrease false positives in several high-stakes tasks such as hate speech detection and misinformation detection, and improves the distributional alignment to human values in GlobalOpinionQA. We hope our work highlights the importance of diversity and how language models can be adapted to consider nuanced perspectives.
In Generative AI We (Dis)Trust? Computational Analysis of Trust and Distrust in Reddit Discussions
Pessianzadeh, Aria, Sultana, Naima, Bulck, Hildegarde Van den, Gefen, David, Jabari, Shahin, Rezapour, Rezvaneh
The rise of generative AI (GenAI) has impacted many aspects of human life. As these systems become embedded in everyday practices, understanding public trust in them also becomes essential for responsible adoption and governance. Prior work on trust in AI has largely drawn from psychology and human-computer interaction, but there is a lack of computational, large-scale, and longitudinal approaches to measuring trust and distrust in GenAI and large language models (LLMs). This paper presents the first computational study of Trust and Distrust in GenAI, using a multi-year Reddit dataset (2022--2025) spanning 39 subreddits and 197,618 posts. Crowd-sourced annotations of a representative sample were combined with classification models to scale analysis. We find that Trust and Distrust are nearly balanced over time, with shifts around major model releases. Technical performance and usability dominate as dimensions, while personal experience is the most frequent reason shaping attitudes. Distinct patterns also emerge across trustors (e.g., experts, ethicists, general users). Our results provide a methodological framework for large-scale Trust analysis and insights into evolving public perceptions of GenAI.
BREATH: A Bio-Radar Embodied Agent for Tonal and Human-Aware Diffusion Music Generation
Wang, Yunzhe, Tang, Xinyu, Huang, Zhixun, Yue, Xiaolong, Zeng, Yuxin
We present a multimodal system for personalized music generation that integrates physiological sensing, LLM-based reasoning, and controllable audio synthesis. A millimeter-wave radar sensor non-invasively captures heart rate and respiration rate. These physiological signals, combined with environmental state, are interpreted by a reasoning agent to infer symbolic musical descriptors, such as tempo, mood intensity, and traditional Chinese pentatonic modes, which are then expressed as structured prompts to guide a diffusion-based audio model in synthesizing expressive melodies. The system emphasizes cultural grounding through tonal embeddings and enables adaptive, embodied music interaction. To evaluate the system, we adopt a research-creation methodology combining case studies, expert feedback, and targeted control experiments. Results show that physiological variations can modulate musical features in meaningful ways, and tonal conditioning enhances alignment with intended modal characteristics. Expert users reported that the system affords intuitive, culturally resonant musical responses and highlighted its potential for therapeutic and interactive applications. This work demonstrates a novel bio-musical feedback loop linking radar-based sensing, prompt reasoning, and generative audio modeling.
AutoGraph-R1: End-to-End Reinforcement Learning for Knowledge Graph Construction
Tsang, Hong Ting, Bai, Jiaxin, Huang, Haoyu, Xiao, Qiao, Zheng, Tianshi, Xu, Baixuan, Liu, Shujie, Song, Yangqiu
Building effective knowledge graphs (KGs) for Retrieval-Augmented Generation (RAG) is pivotal for advancing question answering (QA) systems. However, its effectiveness is hindered by a fundamental disconnect: the knowledge graph (KG) construction process is decoupled from its downstream application, yielding suboptimal graph structures. To bridge this gap, we introduce AutoGraph-R1, the first framework to directly optimize KG construction for task performance using Reinforcement Learning (RL). AutoGraph-R1 trains an LLM constructor by framing graph generation as a policy learning problem, where the reward is derived from the graph's functional utility in a RAG pipeline. We design two novel, task-aware reward functions, one for graphs as knowledge carriers and another as knowledge indices. Across multiple QA benchmarks, AutoGraph-R1 consistently enables graph RAG methods to achieve significant performance gains over using task-agnostic baseline graphs. Our work shows it is possible to close the loop between construction and application, shifting the paradigm from building intrinsically ``good'' graphs to building demonstrably ``useful'' ones.
STANCE: Motion Coherent Video Generation Via Sparse-to-Dense Anchored Encoding
Chen, Zhifei, Xu, Tianshuo, Wu, Leyi, Wang, Luozhou, Yan, Dongyu, You, Zihan, Luo, Wenting, Zhang, Guo, Chen, Yingcong
MIT Figure 1: Videos generated by ST ANCE. Controls yield physically meaningful edits while preserving appearance: increasing mass can reverse collision outcomes, larger speeds produce longer travel and earlier contact, and rotating the arrow reorients trajectories and shifts contact points; z disambiguates out-of-plane intent under camera motion. Examples span both simple collision setups and realistic scenes, including gentle pushes that dislodge or trigger collisions. Video generation has recently made striking visual progress, but maintaining coherent object motion and interactions remains difficult. We trace two practical bottlenecks: (i) human-provided motion hints (e.g., small 2D maps) often collapse to too few effective tokens after encoding, weakening guidance; and (ii) optimizing for appearance and motion in a single head can favor texture over temporal consistency. First, we introduce Instance Cues--a pixel-aligned control signal that turns sparse, user-editable hints into a dense 2.5D (camera-relative) motion field by averaging per-instance flow and augmenting with monocular depth over the instance mask. This reduces depth ambiguity compared to 2D drag/arrow inputs while remaining easy to user. Second, we preserve the salience of these cues in token space with Dense RoPE, which tags a small set of motion tokens (anchored on the first frame) with spatial-addressable rotary embeddings.
This Is Just the Internet Now
T he prompts read like tiny, abstract poems. The scenes come to life before my eyes in the form of AI-generated video. The videos pop up instantly--before my brain has had time to picture the prompts using my own imagination, as if the act of dreaming has been rendered obsolete, inefficient. I am experiencing Vibes, a new social network nested within the Meta AI app--except it's devoid of any actual people. This is a place where users can create an account and ask the company's large language model to illustrate their ideas. The resulting videos are then presented, seemingly at random, to others in a TikTok-style feed.
'Every kind of creative discipline is in danger': Lincoln Lawyer author on the dangers of AI
'Every kind of creative discipline is in danger': Lincoln Lawyer author on the dangers of AI Michael Connelly says tech is moving so fast that he feared his new novel would seem'archaic' before it was published H e is one of the most prolific writers in publishing, averaging more than a novel a year. But even Michael Connelly, the author of the bestselling Lincoln Lawyer series, feared he might fall behind when writing about AI. Connelly's eighth novel in the series, to be released on Tuesday, centres on a lawsuit against an AI company whose chatbot told a 16-year-old boy that it was OK for him to kill his ex-girlfriend for being unfaithful. But as he was writing, he witnessed the technology altering the way the world worked so rapidly that he feared his plot might become out of date. "You don't have to lick your finger and hold it up to the wind to know that AI is a massive change that's coming to science, culture, medicine, everything," he said.