MuseControlLite: Multifunctional Music Generation with Lightweight Conditioners
Tsai, Fang-Duo, Wu, Shih-Lun, Lee, Weijaw, Yang, Sheng-Ping, Chen, Bo-Rui, Cheng, Hao-Chung, Yang, Yi-Hsuan
–arXiv.org Artificial Intelligence
We propose MuseControlLite, a lightweight mechanism designed to fine-tune text-to-music generation models for precise conditioning using various time-varying musical attributes and reference audio signals. The key finding is that positional embeddings, which have been seldom used by text-to-music generation models in the conditioner for text conditions, are critical when the condition of interest is a function of time. Using melody control as an example, our experiments show that simply adding rotary positional embeddings to the decoupled cross-attention layers increases control accuracy from 56.6% to 61.1%, while requiring 6.75 times fewer trainable parameters than state-of-the-art fine-tuning mechanisms, using the same pre-trained diffusion Transformer model of Stable Audio Open. We evaluate various forms of musical attribute control, audio inpainting, and audio outpainting, demonstrating improved controllability over MusicGen-Large and Stable Audio Open ControlNet at a significantly lower fine-tuning cost, with only 85M trainble parameters. Source code, model checkpoints, and demo examples are available at: https://musecontrollite.github.io/web/.
arXiv.org Artificial Intelligence
Jun-25-2025
- Country:
- North America
- Canada (0.04)
- United States > Massachusetts
- Middlesex County > Cambridge (0.04)
- Asia > Taiwan
- Taiwan Province > Taipei (0.04)
- North America
- Genre:
- Research Report (1.00)
- Industry:
- Media > Music (1.00)
- Leisure & Entertainment (1.00)
- Technology: