MusRec: Zero-Shot Text-to-Music Editing via Rectified Flow and Diffusion Transformers

Boudaghi, Ali, Zare, Hadi

arXiv.org Artificial Intelligence 

--Music editing has emerged as an important and practical area of artificial intelligence, with applications ranging from video game and film music production to personalizing existing tracks according to user preferences. However, existing models face significant limitations, such as being restricted to editing synthesized music generated by their own models, requiring highly precise prompts, or necessitating task-specific retraining--thus lacking true zero-shot capability. Experimental results demonstrate that our approach outperforms existing methods in preserving musical content, structural consistency, and editing fidelity, establishing a strong foundation for controllable music editing in real-world scenarios. The landscape of audio generation has shifted dramatically in recent years. Text-to-music systems now allow users to compose entire musical pieces from simple textual descriptions, powered by advances in diffusion models and transformer architectures [1]-[11]. While impressive, these systems are still primarily designed for creation from scratch . In contrast, real-world music practice often revolves around editing: refining a performance, altering instrumentation, or adapting an existing recording into a new style. For musicians, producers, and casual creators alike, the ability to reshape existing audio is often more valuable than generating entirely new material. Music editing, however, is fundamentally more difficult than generation. It requires the model to balance two competing goals: applying the requested modification faithfully, and preserving the rich details of the input recording that should remain unchanged. This trade-off is especially challenging when dealing with expressive, polyphonic, or multi-instrumental recordings. Existing research has attempted to address editing through supervised datasets of paired "before" and "after" examples [12]-[14], or through zero-shot latent manipulations in diffusion models [15]-[17]. Y et most methods remain restricted by their limitation to specific editing tasks, operate mainly on model-generated music rather than arbitrary recordings, and often require very precise prompts to succeed [15], [17].