Goto

Collaborating Authors

 Min, Lejun


Whole-Song Hierarchical Generation of Symbolic Music Using Cascaded Diffusion Models

arXiv.org Artificial Intelligence

Recent deep music generation studies have put much emphasis on long-term generation with structures. However, we are yet to see high-quality, well-structured whole-song generation. In this paper, we make the first attempt to model a full music piece under the realization of compositional hierarchy. With a focus on symbolic representations of pop songs, we define a hierarchical language, in which each level of hierarchy focuses on the semantics and context dependency at a certain music scope. The high-level languages reveal whole-song form, phrase, and cadence, whereas the low-level languages focus on notes, chords, and their local patterns. A cascaded diffusion model is trained to model the hierarchical language, where each level is conditioned on its upper levels. Experiments and analysis show that our model is capable of generating full-piece music with recognizable global verse-chorus structure and cadences, and the music quality is higher than the baselines. Additionally, we show that the proposed model is controllable in a flexible way. By sampling from the interpretable hierarchical languages or adjusting pre-trained external representations, users can control the music flow via various features such as phrase harmonic structures, rhythmic patterns, and accompaniment texture.


MuPT: A Generative Symbolic Music Pretrained Transformer

arXiv.org Artificial Intelligence

In this paper, we explore the application of Large Language Models (LLMs) to the pre-training of music. While the prevalent use of MIDI in music modeling is well-established, our findings suggest that LLMs are inherently more compatible with ABC Notation, which aligns more closely with their design and strengths, thereby enhancing the model's performance in musical composition. To address the challenges associated with misaligned measures from different tracks during generation, we propose the development of a Synchronized Multi-Track ABC Notation (SMT-ABC Notation), which aims to preserve coherence across multiple musical tracks. Our contributions include a series of models capable of handling up to 8192 tokens, covering 90% of the symbolic music data in our training set. Furthermore, we explore the implications of the Symbolic Music Scaling Law (SMS Law) on model performance. The results indicate a promising direction for future research in music generation, offering extensive resources for community-led research through our open-source contributions.


Polyffusion: A Diffusion Model for Polyphonic Score Generation with Internal and External Controls

arXiv.org Artificial Intelligence

ABSTRACT We propose Polyffusion, a diffusion model that generates polyphonic music scores by regarding music as imagelike piano roll representations. The model is capable of controllable music generation with two paradigms: internal control and external control. We show that by using tive modeling [14,15], symbolic music generation still suffers internal and external controls, Polyffusion unifies a from the lack of controllability and consistency at different wide range of music creation tasks, including melody generation time scales [16]. In our study, we experiment with given accompaniment, accompaniment generation the idea of using diffusion models to approach controllable given melody, arbitrary music segment inpainting, and music symbolic music generation. Experimental results Inspired by the high-quality and controllable image show that our model significantly outperforms existing generation that diffusion models have achieved in computer Transformer and sampling-based baselines, and using vision, we devise an image-like piano roll format as pre-trained disentangled representations as external conditions the input, and used a UNet-based diffusion model to stepwise yields more effective controls.