Aria-MIDI: A Dataset of Piano MIDI Files for Symbolic Music Modeling
Bradshaw, Louis, Colton, Simon
–arXiv.org Artificial Intelligence
We introduce an extensive new dataset of MIDI files, created by transcribing audio recordings of piano performances into their constituent notes. The data pipeline we use is multi-stage, employing a language model to autonomously crawl and score audio recordings from the internet based on their metadata, followed by a stage of pruning and segmentation using an audio classifier. The resulting dataset contains over one million distinct MIDI files, comprising roughly 100,000 hours of transcribed audio. We provide an in-depth analysis of our techniques, offering statistical insights, and investigate the content by extracting metadata tags, which we also provide. Central to the success of deep learning as a paradigm has been the datasets used to train neural networks. With the rapid technical advancements and ever-increasing availability of computational power, music has become a popular target for deep learning research, and deep learning in turn has had a notable impact on the study and creation of musical works (Briot et al., 2019). The progress of music-oriented deep learning depends heavily on access to diverse, well-structured datasets. Music is inherently structured and can be represented computationally in a variety of forms (Wiggins, 2016). In this work, we focus on symbolic representations of music, such as MIDI (Musical Instrument Digital Interface), which are widely used for encoding, analyzing, and facilitating the generation of musical compositions by both humans and machines (Ji et al., 2023).
arXiv.org Artificial Intelligence
Jul-1-2025