AITopics | musicgen

Collaborating Authors

musicgen

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Simple and Controllable Music Generation

Neural Information Processing SystemsDec-26-2025, 09:08:46 GMT

We tackle the task of conditional music generation. We introduce MusicGen, a single Language Model (LM) that operates over several streams of compressed discrete music representation, i.e., tokens. Unlike prior work, MusicGen is comprised of a single-stage transformer LM together with efficient token interleaving patterns, which eliminates the need for cascading several models, e.g., hierarchically or upsampling. Following this approach, we demonstrate how MusicGen can generate high-quality samples, both mono and stereo, while being conditioned on textual description or melodic features, allowing better controls over the generated output. We conduct extensive empirical evaluation, considering both automatic and human studies, showing the proposed approach is superior to the evaluated baselines on a standard text-to-music benchmark. Through ablation studies, we shed light over the importance of each of the components comprising MusicGen.

musicgen, name change, simple and controllable music generation, (1 more...)

Neural Information Processing Systems

Industry:

Media > Music (0.67)
Leisure & Entertainment (0.67)

Technology: Information Technology > Artificial Intelligence (0.41)

Add feedback

No Encore: Unlearning as Opt-Out in Music Generation

Kim, Jinju, Kim, Taehan, Waheed, Abdul, Hwan, Jong, Singh, Rita

arXiv.org Artificial IntelligenceSep-25-2025

AI music generation is rapidly emerging in the creative industries, enabling intuitive music generation from textual descriptions. However, these systems pose risks in exploitation of copyrighted creations, raising ethical and legal concerns. In this paper, we present preliminary results on the first application of machine unlearning techniques from an ongoing research to prevent inadvertent usage of creative content. Particularly, we explore existing methods in machine unlearning to a pre-trained Text-to-Music (TTM) baseline and analyze their efficacy in unlearning pre-trained datasets without harming model performance. Through our experiments, we provide insights into the challenges of applying unlearning in music generation, offering a foundational analysis for future works on the application of unlearning for music generative models.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2509.06277

Country: North America > United States (0.47)

Genre: Research Report (1.00)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)
Law (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (0.95)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

Exploring Adapter Design Tradeoffs for Low Resource Music Generation

Mehta, Atharva, Chauhan, Shivam, Choudhury, Monojit

arXiv.org Artificial IntelligenceAug-12-2025

Fine-tuning large-scale music generation models, such as MusicGen and Mustango, is a computationally expensive process, often requiring updates to billions of parameters and, therefore, significant hardware resources. Parameter-Efficient Fine-Tuning (PEFT) techniques, particularly adapter-based methods, have emerged as a promising alternative, enabling adaptation with minimal trainable parameters while preserving model performance. However, the design choices for adapters, including their architecture, placement, and size, are numerous, and it is unclear which of these combinations would produce optimal adapters and why, for a given case of low-resource music genre. In this paper, we attempt to answer this question by studying various adapter configurations for two AI music models, MusicGen and Mustango, on two genres: Hindustani Classical and Turkish Makam music. Our findings reveal distinct trade-offs: convolution-based adapters excel in capturing fine-grained local musical details such as ornamentations and short melodic phrases, while transformer-based adapters better preserve long-range dependencies crucial for structured improvisation. Additionally, we analyze computational resource requirements across different adapter scales, demonstrating how mid-sized adapters (40M parameters) achieve an optimal balance between expressivity and quality. Furthermore, we find that Mustango, a diffusion-based model, generates more diverse outputs with better adherence to the description in the input prompt while lacking in providing stability in notes, rhythm alignment, and aesthetics. Also, it is computationally intensive and requires significantly more time to train. In contrast, autoregressive models like MusicGen offer faster training and are more efficient, and can produce better quality output in comparison, but have slightly higher redundancy in their generations.

artificial intelligence, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2506.21298

Country:

Europe (0.94)
North America > United States (0.46)
Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)

Genre: Research Report > New Finding (1.00)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)

Add feedback

EditGen: Harnessing Cross-Attention Control for Instruction-Based Auto-Regressive Audio Editing

Sioros, Vassilis, Potamianos, Alexandros, Paraskevopoulos, Giorgos

arXiv.org Artificial IntelligenceJul-16-2025

In this study, we investigate leveraging cross-attention control for efficient audio editing within auto-regressive models. Inspired by image editing methodologies, we develop a Prompt-to-Prompt-like approach that guides edits through cross and self-attention mechanisms. Integrating a diffusion-based strategy, influenced by Auffusion, we extend the model's functionality to support refinement edits, establishing a baseline for prompt-guided audio editing. Additionally, we introduce an alternative approach by incorporating MUSICGEN, a pre-trained frozen auto-regressive model, and propose three editing mechanisms, based on Replacement, Reweighting, and Refinement of the attention scores. We employ commonly-used music-specific evaluation metrics and a human study, to gauge time-varying controllability, adherence to global text cues, and overall audio realism. The automatic and human evaluations indicate that the proposed combination of prompt-to-prompt guidance with autoregressive generation models significantly outperforms the diffusion-based baseline in terms of melody, dynamics, and tempo of the generated audio. Our code is available at https://github.com/billsioros/EditGen

diffusion model, large language model, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2507.11096

Country: North America > United States (0.46)

Genre: Research Report > New Finding (0.48)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

EXPOTION: Facial Expression and Motion Control for Multimodal Music Generation

Izzati, Fathinah, Li, Xinyue, Xia, Gus

arXiv.org Artificial IntelligenceJul-8-2025

We propose Expotion (Facial Expression and Motion Control for Multimodal Music Generation), a generative model leveraging multimodal visual controls - specifically, human facial expressions and upper-body motion - as well as text prompts to produce expressive and temporally accurate music. We adopt parameter-efficient fine-tuning (PEFT) on the pretrained text-to-music generation model, enabling fine-grained adaptation to the multimodal controls using a small dataset. To ensure precise synchronization between video and music, we introduce a temporal smoothing strategy to align multiple modalities. Experiments demonstrate that integrating visual features alongside textual descriptions enhances the overall quality of generated music in terms of musicality, creativity, beat-tempo consistency, temporal alignment with the video, and text adherence, surpassing both proposed baselines and existing state-of-the-art video-to-music generation models. Additionally, we introduce a novel dataset consisting of 7 hours of synchronized video recordings capturing expressive facial and upper-body gestures aligned with corresponding music, providing significant potential for future research in multimodal and interactive music generation.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2507.04955

Country:

Asia > South Korea > Daejeon > Daejeon (0.04)
Asia > Middle East > UAE (0.04)

Genre: Research Report > New Finding (0.46)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Vision > Face Recognition (0.90)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

LoopGen: Training-Free Loopable Music Generation

Marincione, Davide, Strano, Giorgio, Crisostomi, Donato, Ribuoli, Roberto, Rodolà, Emanuele

arXiv.org Artificial IntelligenceJun-30-2025

Loops--short audio segments designed for seamless repetition--are central to many music genres, particularly those rooted in dance and electronic styles. However, current generative music models struggle to produce truly loopable audio, as generating a short waveform alone does not guarantee a smooth transition from its endpoint back to its start, often resulting in audible discontinuities. We address this gap by modifying a non-autoregressive model (MAGNeT) to generate tokens in a circular pattern, letting the model attend to the beginning of the audio when creating its ending. This inference-only approach results in generations that are aware of future context and loop naturally, without the need for any additional training or data. We evaluate the consistency of loop transitions by computing token perplexity around the seam of the loop, observing a 55% improvement. Blind listening tests further confirm significant perceptual gains over baseline methods, improving mean ratings by 70%. Taken together, these results highlight the effectiveness of inference-only approaches in improving generative models and underscore the advantages of non-autoregressive methods for context-aware music generation.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2504.04466

Country: North America > United States (0.28)

Genre: Research Report > New Finding (0.93)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)

Add feedback

Fine-Grained control over Music Generation with Activation Steering

Panda, Dipanshu, Joe, Jayden Koshy, R, Harshith M, Narashiman, Swathi, Mathur, Pranay, Veerakumar, Anish, Krishna, Aniruddh, A, Keerthiharan

arXiv.org Artificial IntelligenceJun-13-2025

--We present a method for fine-grained control over music generation through inference-time interventions on an autoregressive generative music transformer called MusicGen. Our approach enables timbre transfer, style transfer, and genre fusion by steering the residual stream using weights of linear probes trained on it, or by steering the attention layer activations in a similar manner . We observe that modelling this as a regression task provides improved performance, hypothesizing that the mean-squared-error better preserve meaningful directional information in the activation space. Combined with the global conditioning offered by text prompts in MusicGen, our method provides both global and local control over music generation. Audio samples illustrating our method are available at our demo page.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2506.10225

Genre: Research Report (0.50)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.48)

Add feedback

A Multimodal Symphony: Integrating Taste and Sound through Generative AI

Spanio, Matteo, Zampini, Massimiliano, Rodà, Antonio, Pierucci, Franco

arXiv.org Artificial IntelligenceMar-4-2025

Over recent years, the rapid evolution and progress of generative models have opened new possibilities in manipulating images, audio, and text, both independently and in a multimodal context. These AI advancements have ignited considerable debate about the essence of these human-engineered "intelligences". Critics have termed large language models (LLMs) as "statistical parrots" (Bender et al., 2021) due to their reliance on data. However, others view them as advanced tools capable of emulating and exploring the intricate structures of the human brain (Zhao et al., 2023; Abbasiantaeb et al., 2024; Fayyaz et al., 2024). Despite this division, it has become increasingly clear that limiting these models to a few specialized areas greatly restricts their potential to fully grasp and portray the complexity of the world. Therefore the integration of sensory modalities through technology, particularly using AI, has emerged as a compelling frontier in computer science and cognitive research (Murari et al., 2020; Turato et al., 2022). As multimodal AI models advance, they increasingly offer innovative solutions for bridging human experiences and machine understanding across diverse sensory domains. These models, which merge information from different modalities enable machines to interpret complex real-world scenarios and provide more nuanced outputs. While recent research has predominantly focused on the intersection of audio and visual modalities, the potential for integrating taste and sound remains relatively unexplored.

correspondence, music, participant, (16 more...)

arXiv.org Artificial Intelligence

2503.02823

Country:

North America > United States > New York > New York County > New York City (0.04)
Europe > Italy (0.04)
Asia > Singapore (0.04)
(3 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)
Health & Medicine > Therapeutic Area > Psychiatry/Psychology > Mental Health (0.68)
Health & Medicine > Therapeutic Area > Neurology (0.68)

Technology:

Information Technology > Artificial Intelligence > Cognitive Science > Emotion (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.86)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.51)

Add feedback

Simple and Controllable Music Generation

Neural Information Processing SystemsJan-19-2025, 15:58:43 GMT

musicgen, simple and controllable music generation

Neural Information Processing Systems

Industry:

Media > Music (0.66)
Leisure & Entertainment (0.66)

Technology: Information Technology > Artificial Intelligence (0.45)

Add feedback

Watermarking Training Data of Music Generation Models

Epple, Pascal, Shilov, Igor, Stevanoski, Bozhidar, de Montjoye, Yves-Alexandre

arXiv.org Artificial IntelligenceDec-12-2024

Generative Artificial Intelligence (Gen-AI) models are increasingly used to produce content across domains, including text, images, and audio. While these models represent a major technical breakthrough, they gain their generative capabilities from being trained on enormous amounts of human-generated content, which often includes copyrighted material. In this work, we investigate whether audio watermarking techniques can be used to detect an unauthorized usage of content to train a music generation model. We compare outputs generated by a model trained on watermarked data to a model trained on non-watermarked data. We study factors that impact the model's generation behaviour: the watermarking technique, the proportion of watermarked samples in the training set, and the robustness of the watermarking technique against the model's tokenizer. Our results show that audio watermarking techniques, including some that are imperceptible to humans, can lead to noticeable shifts in the model's outputs. We also study the robustness of a state-of-the-art watermarking technique to removal techniques.

artificial intelligence, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2412.08549

Country:

Europe > United Kingdom > England > Tyne and Wear > Sunderland (0.04)
Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)

Genre: Research Report > New Finding (1.00)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback