musicgen
Simple and Controllable Music Generation
We tackle the task of conditional music generation. We introduce MusicGen, a single Language Model (LM) that operates over several streams of compressed discrete music representation, i.e., tokens. Unlike prior work, MusicGen is comprised of a single-stage transformer LM together with efficient token interleaving patterns, which eliminates the need for cascading several models, e.g., hierarchically or upsampling. Following this approach, we demonstrate how MusicGen can generate high-quality samples, both mono and stereo, while being conditioned on textual description or melodic features, allowing better controls over the generated output. We conduct extensive empirical evaluation, considering both automatic and human studies, showing the proposed approach is superior to the evaluated baselines on a standard text-to-music benchmark. Through ablation studies, we shed light over the importance of each of the components comprising MusicGen.
- Media > Music (0.67)
- Leisure & Entertainment (0.67)
No Encore: Unlearning as Opt-Out in Music Generation
Kim, Jinju, Kim, Taehan, Waheed, Abdul, Hwan, Jong, Singh, Rita
AI music generation is rapidly emerging in the creative industries, enabling intuitive music generation from textual descriptions. However, these systems pose risks in exploitation of copyrighted creations, raising ethical and legal concerns. In this paper, we present preliminary results on the first application of machine unlearning techniques from an ongoing research to prevent inadvertent usage of creative content. Particularly, we explore existing methods in machine unlearning to a pre-trained Text-to-Music (TTM) baseline and analyze their efficacy in unlearning pre-trained datasets without harming model performance. Through our experiments, we provide insights into the challenges of applying unlearning in music generation, offering a foundational analysis for future works on the application of unlearning for music generative models.
- Europe (0.14)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- Media > Music (1.00)
- Leisure & Entertainment (1.00)
- Law (1.00)
Exploring Adapter Design Tradeoffs for Low Resource Music Generation
Mehta, Atharva, Chauhan, Shivam, Choudhury, Monojit
Fine-tuning large-scale music generation models, such as MusicGen and Mustango, is a computationally expensive process, often requiring updates to billions of parameters and, therefore, significant hardware resources. Parameter-Efficient Fine-Tuning (PEFT) techniques, particularly adapter-based methods, have emerged as a promising alternative, enabling adaptation with minimal trainable parameters while preserving model performance. However, the design choices for adapters, including their architecture, placement, and size, are numerous, and it is unclear which of these combinations would produce optimal adapters and why, for a given case of low-resource music genre. In this paper, we attempt to answer this question by studying various adapter configurations for two AI music models, MusicGen and Mustango, on two genres: Hindustani Classical and Turkish Makam music. Our findings reveal distinct trade-offs: convolution-based adapters excel in capturing fine-grained local musical details such as ornamentations and short melodic phrases, while transformer-based adapters better preserve long-range dependencies crucial for structured improvisation. Additionally, we analyze computational resource requirements across different adapter scales, demonstrating how mid-sized adapters (40M parameters) achieve an optimal balance between expressivity and quality. Furthermore, we find that Mustango, a diffusion-based model, generates more diverse outputs with better adherence to the description in the input prompt while lacking in providing stability in notes, rhythm alignment, and aesthetics. Also, it is computationally intensive and requires significantly more time to train. In contrast, autoregressive models like MusicGen offer faster training and are more efficient, and can produce better quality output in comparison, but have slightly higher redundancy in their generations.
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
- Europe > Ireland > Leinster > County Dublin > Dublin (0.05)
- North America > Mexico > Mexico City > Mexico City (0.04)
- (9 more...)
- Media > Music (1.00)
- Leisure & Entertainment (1.00)
EditGen: Harnessing Cross-Attention Control for Instruction-Based Auto-Regressive Audio Editing
Sioros, Vassilis, Potamianos, Alexandros, Paraskevopoulos, Giorgos
In this study, we investigate leveraging cross-attention control for efficient audio editing within auto-regressive models. Inspired by image editing methodologies, we develop a Prompt-to-Prompt-like approach that guides edits through cross and self-attention mechanisms. Integrating a diffusion-based strategy, influenced by Auffusion, we extend the model's functionality to support refinement edits, establishing a baseline for prompt-guided audio editing. Additionally, we introduce an alternative approach by incorporating MUSICGEN, a pre-trained frozen auto-regressive model, and propose three editing mechanisms, based on Replacement, Reweighting, and Refinement of the attention scores. We employ commonly-used music-specific evaluation metrics and a human study, to gauge time-varying controllability, adherence to global text cues, and overall audio realism. The automatic and human evaluations indicate that the proposed combination of prompt-to-prompt guidance with autoregressive generation models significantly outperforms the diffusion-based baseline in terms of melody, dynamics, and tempo of the generated audio. Our code is available at https://github.com/billsioros/EditGen
- North America > United States > California > San Francisco County > San Francisco (0.14)
- Asia > China > Hong Kong (0.04)
- Media > Music (1.00)
- Leisure & Entertainment (1.00)
EXPOTION: Facial Expression and Motion Control for Multimodal Music Generation
Izzati, Fathinah, Li, Xinyue, Xia, Gus
We propose Expotion (Facial Expression and Motion Control for Multimodal Music Generation), a generative model leveraging multimodal visual controls - specifically, human facial expressions and upper-body motion - as well as text prompts to produce expressive and temporally accurate music. We adopt parameter-efficient fine-tuning (PEFT) on the pretrained text-to-music generation model, enabling fine-grained adaptation to the multimodal controls using a small dataset. To ensure precise synchronization between video and music, we introduce a temporal smoothing strategy to align multiple modalities. Experiments demonstrate that integrating visual features alongside textual descriptions enhances the overall quality of generated music in terms of musicality, creativity, beat-tempo consistency, temporal alignment with the video, and text adherence, surpassing both proposed baselines and existing state-of-the-art video-to-music generation models. Additionally, we introduce a novel dataset consisting of 7 hours of synchronized video recordings capturing expressive facial and upper-body gestures aligned with corresponding music, providing significant potential for future research in multimodal and interactive music generation.
- Asia > South Korea > Daejeon > Daejeon (0.04)
- Asia > Middle East > UAE (0.04)
- Media > Music (1.00)
- Leisure & Entertainment (1.00)
LoopGen: Training-Free Loopable Music Generation
Marincione, Davide, Strano, Giorgio, Crisostomi, Donato, Ribuoli, Roberto, Rodolà, Emanuele
Loops--short audio segments designed for seamless repetition--are central to many music genres, particularly those rooted in dance and electronic styles. However, current generative music models struggle to produce truly loopable audio, as generating a short waveform alone does not guarantee a smooth transition from its endpoint back to its start, often resulting in audible discontinuities. We address this gap by modifying a non-autoregressive model (MAGNeT) to generate tokens in a circular pattern, letting the model attend to the beginning of the audio when creating its ending. This inference-only approach results in generations that are aware of future context and loop naturally, without the need for any additional training or data. We evaluate the consistency of loop transitions by computing token perplexity around the seam of the loop, observing a 55% improvement. Blind listening tests further confirm significant perceptual gains over baseline methods, improving mean ratings by 70%. Taken together, these results highlight the effectiveness of inference-only approaches in improving generative models and underscore the advantages of non-autoregressive methods for context-aware music generation.
- Media > Music (1.00)
- Leisure & Entertainment (1.00)
Fine-Grained control over Music Generation with Activation Steering
Panda, Dipanshu, Joe, Jayden Koshy, R, Harshith M, Narashiman, Swathi, Mathur, Pranay, Veerakumar, Anish, Krishna, Aniruddh, A, Keerthiharan
--We present a method for fine-grained control over music generation through inference-time interventions on an autoregressive generative music transformer called MusicGen. Our approach enables timbre transfer, style transfer, and genre fusion by steering the residual stream using weights of linear probes trained on it, or by steering the attention layer activations in a similar manner . We observe that modelling this as a regression task provides improved performance, hypothesizing that the mean-squared-error better preserve meaningful directional information in the activation space. Combined with the global conditioning offered by text prompts in MusicGen, our method provides both global and local control over music generation. Audio samples illustrating our method are available at our demo page.
- Media > Music (1.00)
- Leisure & Entertainment (1.00)
A Multimodal Symphony: Integrating Taste and Sound through Generative AI
Spanio, Matteo, Zampini, Massimiliano, Rodà, Antonio, Pierucci, Franco
Over recent years, the rapid evolution and progress of generative models have opened new possibilities in manipulating images, audio, and text, both independently and in a multimodal context. These AI advancements have ignited considerable debate about the essence of these human-engineered "intelligences". Critics have termed large language models (LLMs) as "statistical parrots" (Bender et al., 2021) due to their reliance on data. However, others view them as advanced tools capable of emulating and exploring the intricate structures of the human brain (Zhao et al., 2023; Abbasiantaeb et al., 2024; Fayyaz et al., 2024). Despite this division, it has become increasingly clear that limiting these models to a few specialized areas greatly restricts their potential to fully grasp and portray the complexity of the world. Therefore the integration of sensory modalities through technology, particularly using AI, has emerged as a compelling frontier in computer science and cognitive research (Murari et al., 2020; Turato et al., 2022). As multimodal AI models advance, they increasingly offer innovative solutions for bridging human experiences and machine understanding across diverse sensory domains. These models, which merge information from different modalities enable machines to interpret complex real-world scenarios and provide more nuanced outputs. While recent research has predominantly focused on the intersection of audio and visual modalities, the potential for integrating taste and sound remains relatively unexplored.
- North America > United States > New York > New York County > New York City (0.04)
- Europe > Italy (0.04)
- Asia > Singapore (0.04)
- (3 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Media > Music (1.00)
- Leisure & Entertainment (1.00)
- Health & Medicine > Therapeutic Area > Psychiatry/Psychology > Mental Health (0.68)
- Health & Medicine > Therapeutic Area > Neurology (0.68)
- Information Technology > Artificial Intelligence > Cognitive Science > Emotion (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.86)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.51)
Simple and Controllable Music Generation
We tackle the task of conditional music generation. We introduce MusicGen, a single Language Model (LM) that operates over several streams of compressed discrete music representation, i.e., tokens. Unlike prior work, MusicGen is comprised of a single-stage transformer LM together with efficient token interleaving patterns, which eliminates the need for cascading several models, e.g., hierarchically or upsampling. Following this approach, we demonstrate how MusicGen can generate high-quality samples, both mono and stereo, while being conditioned on textual description or melodic features, allowing better controls over the generated output. We conduct extensive empirical evaluation, considering both automatic and human studies, showing the proposed approach is superior to the evaluated baselines on a standard text-to-music benchmark.
- Media > Music (0.66)
- Leisure & Entertainment (0.66)
Watermarking Training Data of Music Generation Models
Epple, Pascal, Shilov, Igor, Stevanoski, Bozhidar, de Montjoye, Yves-Alexandre
Generative Artificial Intelligence (Gen-AI) models are increasingly used to produce content across domains, including text, images, and audio. While these models represent a major technical breakthrough, they gain their generative capabilities from being trained on enormous amounts of human-generated content, which often includes copyrighted material. In this work, we investigate whether audio watermarking techniques can be used to detect an unauthorized usage of content to train a music generation model. We compare outputs generated by a model trained on watermarked data to a model trained on non-watermarked data. We study factors that impact the model's generation behaviour: the watermarking technique, the proportion of watermarked samples in the training set, and the robustness of the watermarking technique against the model's tokenizer. Our results show that audio watermarking techniques, including some that are imperceptible to humans, can lead to noticeable shifts in the model's outputs. We also study the robustness of a state-of-the-art watermarking technique to removal techniques.
- Europe > United Kingdom > England > Tyne and Wear > Sunderland (0.04)
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)