accompaniment
- North America > United States (0.04)
- Asia > China > Shanghai > Shanghai (0.04)
- Media > Music (1.00)
- Leisure & Entertainment (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.94)
- Information Technology > Artificial Intelligence > Representation & Reasoning (0.68)
- Information Technology > Artificial Intelligence > Vision (0.67)
- Research Report > Experimental Study (1.00)
- Research Report > New Finding (0.92)
- Media > Music (1.00)
- Leisure & Entertainment (1.00)
- Information Technology (0.67)
Generative Adversarial Post-Training Mitigates Reward Hacking in Live Human-AI Music Interaction
Wu, Yusong, Brade, Stephen, Ma, Teng, Fowler, Tia-Jane, Yang, Enning, Banar, Berker, Courville, Aaron, Jaques, Natasha, Huang, Cheng-Zhi Anna
Most applications of generative AI involve a sequential interaction in which a person inputs a prompt and waits for a response, and where reaction time and adaptiv-ity are not important factors. In contrast, live jamming is a collaborative interaction that requires real-time coordination and adaptation without access to the other player's future moves, while preserving diversity to sustain a creative flow. Reinforcement learning post-training enables effective adaptation through on-policy interaction, yet it often reduces output diversity by exploiting coherence-based rewards. This collapse, known as "reward hacking", affects many RL post-training pipelines, but is especially harmful in live jamming, where musical creativity relies on dynamic variation and mutual responsiveness. In this paper, we propose a novel adversarial training method on policy-generated trajectories to mitigate reward hacking in RL post-training for melody-to-chord accompaniment. A co-evolving discriminator separates policy trajectories from the data distribution, while the policy maximizes the discriminator output in addition to coherence rewards to prevent collapse to trivial outputs. We evaluate accompaniment quality and output diversity in simulation with both fixed test melodies and learned melody agents, and we conduct a user study with the model deployed in a real-time interactive system with expert musicians. Quantitative evaluation and user feedback demonstrate improved output diversity, harmonic coherence, adaptation speed and user agency. Our results demonstrate a simple yet effective method to mitigate reward hacking in RL post-training of generative sequence models. The combination of large-scale transformer-based models and reinforcement learning (RL) post-training has revolutionized AI, with over 1 billion people now using large language models (LLMs) trained with these techniques (OpenAI, 2025; Perez, 2025). However, most applications of generative AI still involve a slow back-and-forth interaction, where the user inputs a request, and then waits--sometimes several minutes--for a response.
- Asia > India (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- Asia > Japan > Honshū > Chūbu > Ishikawa Prefecture > Kanazawa (0.04)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Media > Music (1.00)
- Leisure & Entertainment (1.00)
DUO-TOK: Dual-Track Semantic Music Tokenizer for Vocal-Accompaniment Generation
Lin, Rui, Wu, Zhiyue, Le, Jiahe, Wang, Kangdi, Chen, Weixiong, Dai, Junyu, Jiang, Tao
Duo-Tok is a source-aware dual-codebook tokenizer for vocal-accompaniment music that targets the growing tension between reconstruction quality and language-model (LM) learnability in modern lyrics-to-song systems. Existing codecs either prioritize high-fidelity reconstruction with difficult-to-model acoustic tokens or compress aggressively into semantic tokens that are LM-friendly but lossy, and they rarely make the tokenizer itself aware of dual-track structure. Duo-Tok follows a four-stage, SSL-centered pipeline: we first pretrain a BEST-RQ-style encoder on large-scale audio, then stabilize and factorize the representation with Gaussian replacement noise and multi-task supervision, before freezing the encoder to learn SimVQ-based dual codebooks with hard routing for vocals and accompaniment, and finally training latent diffusion decoders on top of the discrete tokens. Duo-Tok at 0.75 kbps shifts the empirical reconstruction-generation Pareto frontier, achieving the best music-tagging AP and the lowest vocabulary-normalized LM perplexity among compared codecs while maintaining reconstruction quality comparable to state-of-the-art music tokenizers.
- Media > Music (0.94)
- Leisure & Entertainment (0.94)
- North America > United States (0.04)
- Asia > China > Shanghai > Shanghai (0.04)
- Media > Music (1.00)
- Leisure & Entertainment (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.94)
- Information Technology > Artificial Intelligence > Representation & Reasoning (0.68)
- Information Technology > Artificial Intelligence > Vision (0.67)
- Research Report > Experimental Study (1.00)
- Research Report > New Finding (0.92)
- Media > Music (1.00)
- Leisure & Entertainment (1.00)
- Information Technology (0.67)
Exploring Situated Stabilities of a Rhythm Generation System through Variational Cross-Examination
Kotowski, Błażej, Evans, Nicholas, Haki, Behzad, Font, Frederic, Jordà, Sergi
This paper investigates GrooveTransformer, a real-time rhythm generation system, through the postphenomenological framework of Variational Cross-Examination (VCE). By reflecting on its deployment across three distinct artistic contexts, we identify three stabilities: an autonomous drum accompaniment generator, a rhythmic control voltage sequencer in Eurorack format, and a rhythm driver for a harmonic accompaniment system. The versatility of its applications was not an explicit goal from the outset of the project. Thus, we ask: how did this multistability emerge? Through VCE, we identify three key contributors to its emergence: the affordances of system invariants, the interdisciplinary collaboration, and the situated nature of its development. We conclude by reflecting on the viability of VCE as a descriptive and analytical method for Digital Musical Instrument (DMI) design, emphasizing its value in uncovering how technologies mediate, co-shape, and are co-shaped by users and contexts.
- Oceania > Palau (0.05)
- Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.05)
- North America > United States > Pennsylvania > Centre County > University Park (0.04)
- (3 more...)
- Media > Music (1.00)
- Leisure & Entertainment (1.00)
Versatile Framework for Song Generation with Prompt-based Control
Zhang, Yu, Guo, Wenxiang, Pan, Changhao, Zhu, Zhiyuan, Li, Ruiqi, Lu, Jingyu, Huang, Rongjie, Zhang, Ruiyuan, Hong, Zhiqing, Jiang, Ziyue, Zhao, Zhou
Song generation focuses on producing controllable high-quality songs based on various prompts. However, existing methods struggle to generate vocals and accompaniments with prompt-based control and proper alignment. Additionally, they fall short in supporting various tasks. To address these challenges, we introduce VersBand, a multi-task song generation framework for synthesizing high-quality, aligned songs with prompt-based control. VersBand comprises these primary models: 1) VocalBand, a decoupled model, leverages the flow-matching method for generating singing styles, pitches, and mel-spectrograms, allowing fast, high-quality vocal generation with style control. 2) AccompBand, a flow-based transformer model, incorporates the Band-MOE, selecting suitable experts for enhanced quality, alignment, and control. This model allows for generating controllable, high-quality accompaniments aligned with vocals. 3) Two generation models, LyricBand for lyrics and MelodyBand for melodies, contribute to the comprehensive multi-task song generation system, allowing for extensive control based on multiple prompts. Experimental results show that VersBand outperforms baseline models across multiple song generation tasks using objective and subjective metrics. Demos and codes are available at https://aaronz345.github.io/VersBandDemo and https://github.com/AaronZ345/VersBand.
- North America > Canada > Quebec > Montreal (0.04)
- Europe > United Kingdom > North Sea > Southern North Sea (0.04)
- Asia > China (0.04)
- North America > United States (0.04)
- Media > Music (1.00)
- Leisure & Entertainment (1.00)
YuE: Scaling Open Foundation Models for Long-Form Music Generation
Yuan, Ruibin, Lin, Hanfeng, Guo, Shuyue, Zhang, Ge, Pan, Jiahao, Zang, Yongyi, Liu, Haohe, Liang, Yiming, Ma, Wenye, Du, Xingjian, Du, Xinrun, Ye, Zhen, Zheng, Tianyu, Ma, Yinghao, Liu, Minghao, Tian, Zeyue, Zhou, Ziya, Xue, Liumeng, Qu, Xingwei, Li, Yizhi, Wu, Shangda, Shen, Tianhao, Ma, Ziyang, Zhan, Jun, Wang, Chunhui, Wang, Yatian, Chi, Xiaowei, Zhang, Xinyue, Yang, Zhenzhu, Wang, Xiangzhou, Liu, Shansong, Mei, Lingrui, Li, Peng, Wang, Junjie, Yu, Jianwei, Pang, Guojian, Li, Xu, Wang, Zihao, Zhou, Xiaohuan, Yu, Lijun, Benetos, Emmanouil, Chen, Yong, Lin, Chenghua, Chen, Xie, Xia, Gus, Zhang, Zhaoxiang, Zhang, Chao, Chen, Wenhu, Zhou, Xinyu, Qiu, Xipeng, Dannenberg, Roger, Liu, Jiaheng, Yang, Jian, Huang, Wenhao, Xue, Wei, Tan, Xu, Guo, Yike
We tackle the task of long-form music generation--particularly the challenging \textbf{lyrics-to-song} problem--by introducing YuE, a family of open foundation models based on the LLaMA2 architecture. Specifically, YuE scales to trillions of tokens and generates up to five minutes of music while maintaining lyrical alignment, coherent musical structure, and engaging vocal melodies with appropriate accompaniment. It achieves this through (1) track-decoupled next-token prediction to overcome dense mixture signals, (2) structural progressive conditioning for long-context lyrical alignment, and (3) a multitask, multiphase pre-training recipe to converge and generalize. In addition, we redesign the in-context learning technique for music generation, enabling versatile style transfer (e.g., converting Japanese city pop into an English rap while preserving the original accompaniment) and bidirectional generation. Through extensive evaluation, we demonstrate that YuE matches or even surpasses some of the proprietary systems in musicality and vocal agility. In addition, fine-tuning YuE enables additional controls and enhanced support for tail languages. Furthermore, beyond generation, we show that YuE's learned representations can perform well on music understanding tasks, where the results of YuE match or exceed state-of-the-art methods on the MARBLE benchmark. Keywords: lyrics2song, song generation, long-form, foundation model, music generation
- Asia > China (0.27)
- Asia > Japan (0.24)
- North America > United States (0.14)
- Research Report > New Finding (0.67)
- Research Report > Promising Solution (0.48)
- Media > Music (1.00)
- Leisure & Entertainment (1.00)
- Information Technology > Artificial Intelligence > Speech (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.67)
SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation
Liu, Zihan, Ding, Shuangrui, Zhang, Zhixiong, Dong, Xiaoyi, Zhang, Pan, Zang, Yuhang, Cao, Yuhang, Lin, Dahua, Wang, Jiaqi
Text-to-song generation, the task of creating vocals and accompaniment from textual inputs, poses significant challenges due to domain complexity and data scarcity. Existing approaches often employ multi-stage generation procedures, resulting in cumbersome training and inference pipelines. In this paper, we propose SongGen, a fully open-source, single-stage auto-regressive transformer designed for controllable song generation. The proposed model facilitates fine-grained control over diverse musical attributes, including lyrics and textual descriptions of instrumentation, genre, mood, and timbre, while also offering an optional three-second reference clip for voice cloning. Within a unified auto-regressive framework, SongGen supports two output modes: mixed mode, which generates a mixture of vocals and accompaniment directly, and dual-track mode, which synthesizes them separately for greater flexibility in downstream applications. We explore diverse token pattern strategies for each mode, leading to notable improvements and valuable insights. Furthermore, we design an automated data preprocessing pipeline with effective quality control. To foster community engagement and future research, we will release our model weights, training code, annotated data, and preprocessing pipeline. The generated samples are showcased on our project page at https://liuzh-19.github.io/SongGen/ , and the code will be available at https://github.com/LiuZH-19/SongGen .
- Asia > China > Shanghai > Shanghai (0.04)
- Asia > China > Hong Kong (0.04)
- North America > United States > California > Los Angeles County > Long Beach (0.04)
- Asia > China > Beijing > Beijing (0.04)
- Research Report (0.51)
- Workflow (0.47)
- Media > Music (1.00)
- Leisure & Entertainment (1.00)
- Information Technology (0.66)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Speech (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)