midi
Audeo: AudioGenerationforaSilentPerformance Video
In the last step, we implement Midi synthesizers to generate realistic music.Audeoconverts video to audio smoothly and clearly withonlyafewsetupconstraints.Weevaluate Audeoonpianoperformancevideos collected from YouTube and obtain that their generated music is of reasonable audio quality andcanbesuccessfully recognized withhighprecision bypopular music identification software. The source code with examples is available in a Githubrepository3.
- North America > United States > California > Santa Clara County > Stanford (0.04)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
- Europe > Ireland (0.04)
- Media > Music (1.00)
- Leisure & Entertainment (1.00)
HNote: Extending YNote with Hexadecimal Encoding for Fine-Tuning LLMs in Music Modeling
Chu, Hung-Ying, Wei, Shao-Yu, Chen, Guan-Wei, Hung, Tzu-Wei, Tsai, ChengYang, Lin, Yu-Cheng
Recent advances in large language models (LLMs) have created new opportunities for symbolic music generation. However, existing formats such as MIDI, ABC, and MusicXML are either overly complex or structurally inconsistent, limiting their suitability for token-based learning architectures. To address these challenges, we propose HNote, a novel hexadecimal-based notation system extended from YNote, which encodes both pitch and duration within a fixed 32-unit measure framework. This design ensures alignment, reduces ambiguity, and is directly compatible with LLM architectures. We converted 12,300 Jiangnan-style songs generated from traditional folk pieces from YNote into HNote, and fine-tuned LLaMA-3.1(8B) using parameter-efficient LoRA. Experimental results show that HNote achieves a syntactic correctness rate of 82.5%, and BLEU and ROUGE evaluations demonstrate strong symbolic and structural similarity, producing stylistically coherent compositions. This study establishes HNote as an effective framework for integrating LLMs with cultural music modeling.
- Media > Music (1.00)
- Leisure & Entertainment (1.00)
- North America > United States > Washington > King County > Seattle (0.04)
- North America > United States > California > Santa Clara County > Stanford (0.04)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
- Europe > Ireland (0.04)
- Media > Music (1.00)
- Leisure & Entertainment (1.00)
- Information Technology > Sensing and Signal Processing > Image Processing (1.00)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Speech (0.94)
PianoVAM: A Multimodal Piano Performance Dataset
Kim, Yonghyun, Park, Junhyung, Bae, Joonhyung, Kim, Kirak, Kwon, Taegyun, Lerch, Alexander, Nam, Juhan
The multimodal nature of music performance has driven increasing interest in data beyond the audio domain within the music information retrieval (MIR) community. This paper introduces PianoVAM, a comprehensive piano performance dataset that includes videos, audio, MIDI, hand landmarks, fingering labels, and rich metadata. The dataset was recorded using a Disklavier piano, capturing audio and MIDI from amateur pianists during their daily practice sessions, alongside synchronized top-view videos in realistic and varied performance conditions. Hand landmarks and fingering labels were extracted using a pretrained hand pose estimation model and a semi-automated fingering annotation algorithm. We discuss the challenges encountered during data collection and the alignment process across different modalities. Additionally, we describe our fingering annotation method based on hand landmarks extracted from videos. Finally, we present benchmarking results for both audio-only and audio-visual piano transcription using the PianoVAM dataset and discuss additional potential applications.
- North America > United States (0.14)
- Asia > South Korea > Daejeon > Daejeon (0.04)
- Research Report > New Finding (0.68)
- Research Report > Experimental Study (0.68)
- Media > Music (1.00)
- Leisure & Entertainment (1.00)
Video Soundtrack Generation by Aligning Emotions and Temporal Boundaries
Sulun, Serkan, Viana, Paula, Davies, Matthew E. P.
We introduce EMSYNC, a video-based symbolic music generation model that aligns music with a video's emotional content and temporal boundaries. It follows a two-stage framework, where a pretrained video emotion classifier extracts emotional features, and a conditional music generator produces MIDI sequences guided by both emotional and temporal cues. We introduce boundary offsets, a novel temporal conditioning mechanism that enables the model to anticipate and align musical chords with scene cuts. Unlike existing models, our approach retains event-based encoding, ensuring fine-grained timing control and expressive musical nuances. We also propose a mapping scheme to bridge the video emotion classifier, which produces discrete emotion categories, with the emotion-conditioned MIDI generator, which operates on continuous-valued valence-arousal inputs. In subjective listening tests, EMSYNC outperforms state-of-the-art models across all subjective metrics, for music theory-aware participants as well as the general listeners.
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > Nebraska (0.04)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- Media > Music (1.00)
- Leisure & Entertainment (1.00)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Cognitive Science > Emotion (0.68)
- Information Technology > Artificial Intelligence > Representation & Reasoning (0.66)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)
PIAST: A Multimodal Piano Dataset with Audio, Symbolic and Text
Bang, Hayeon, Choi, Eunjin, Finch, Megan, Doh, Seungheon, Lee, Seolhee, Lee, Gyeong-Hoon, Nam, Juhan
While piano music has become a significant area of study in Music Information Retrieval (MIR), there is a notable lack of datasets for piano solo music with text labels. To address this gap, we present PIAST (PIano dataset with Audio, Symbolic, and Text), a piano music dataset. Utilizing a piano-specific taxonomy of semantic tags, we collected 9,673 tracks from YouTube and added human annotations for 2,023 tracks by music experts, resulting in two subsets: PIAST-YT and PIAST-AT. Both include audio, text, tag annotations, and transcribed MIDI utilizing state-of-the-art piano transcription and beat tracking models. Among many possible tasks with the multi-modal dataset, we conduct music tagging and retrieval using both audio and MIDI data and report baseline performances to demonstrate its potential as a valuable resource for MIR research.
- Media > Music (1.00)
- Leisure & Entertainment (1.00)
CLaMP 2: Multimodal Music Information Retrieval Across 101 Languages Using Large Language Models
Wu, Shangda, Wang, Yashan, Yuan, Ruibin, Guo, Zhancheng, Tan, Xu, Zhang, Ge, Zhou, Monan, Chen, Jing, Mu, Xuefeng, Gao, Yuejie, Dong, Yuanliang, Liu, Jiafeng, Li, Xiaobing, Yu, Feng, Sun, Maosong
Challenges in managing linguistic diversity and integrating various musical modalities are faced by current music information retrieval systems. These limitations reduce their effectiveness in a global, multimodal music environment. To address these issues, we introduce CLaMP 2, a system compatible with 101 languages that supports both ABC notation (a text-based musical notation format) and MIDI (Musical Instrument Digital Interface) for music information retrieval. CLaMP 2, pre-trained on 1.5 million ABC-MIDI-text triplets, includes a multilingual text encoder and a multimodal music encoder aligned via contrastive learning. By leveraging large language models, we obtain refined and consistent multilingual descriptions at scale, significantly reducing textual noise and balancing language distribution. Our experiments show that CLaMP 2 achieves state-of-the-art results in both multilingual semantic search and music classification across modalities, thus establishing a new standard for inclusive and global music information retrieval.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > Rhode Island (0.04)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- (14 more...)
- Media > Music (1.00)
- Leisure & Entertainment (1.00)
- Education > Curriculum > Subject-Specific Education (0.34)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)
Toward Fully Self-Supervised Multi-Pitch Estimation
Cwitkowitz, Frank, Duan, Zhiyao
Multi-pitch estimation is a decades-long research problem involving the detection of pitch activity associated with concurrent musical events within multi-instrument mixtures. Supervised learning techniques have demonstrated solid performance on more narrow characterizations of the task, but suffer from limitations concerning the shortage of large-scale and diverse polyphonic music datasets with multi-pitch annotations. We present a suite of self-supervised learning objectives for multi-pitch estimation, which encourage the concentration of support around harmonics, invariance to timbral transformations, and equivariance to geometric transformations. These objectives are sufficient to train an entirely convolutional autoencoder to produce multi-pitch salience-grams directly, without any fine-tuning. Despite training exclusively on a collection of synthetic single-note audio samples, our fully self-supervised framework generalizes to polyphonic music mixtures, and achieves performance comparable to supervised models trained on conventional multi-pitch datasets.
- Asia > Japan > Honshū > Chūbu > Toyama Prefecture > Toyama (0.04)
- North America > United States > New York > Monroe County > Rochester (0.04)
- Media > Music (1.00)
- Leisure & Entertainment (1.00)
Expressive Acoustic Guitar Sound Synthesis with an Instrument-Specific Input Representation and Diffusion Outpainting
Kim, Hounsu, Choi, Soonbeom, Nam, Juhan
Synthesizing performing guitar sound is a highly challenging task due to the polyphony and high variability in expression. Recently, deep generative models have shown promising results in synthesizing expressive polyphonic instrument sounds from music scores, often using a generic MIDI input. In this work, we propose an expressive acoustic guitar sound synthesis model with a customized input representation to the instrument, which we call guitarroll. We implement the proposed approach using diffusion-based outpainting which can generate audio with long-term consistency. To overcome the lack of MIDI/audio-paired datasets, we used not only an existing guitar dataset but also collected data from a high quality sample-based guitar synthesizer. Through quantitative and qualitative evaluations, we show that our proposed model has higher audio quality than the baseline model and generates more realistic timbre sounds than the previous leading work.
- North America > United States (0.04)
- North America > Puerto Rico > San Juan > San Juan (0.04)
- Europe > Italy (0.04)
- (2 more...)
- Media > Music (1.00)
- Leisure & Entertainment (1.00)
JAMMIN-GPT: Text-based Improvisation using LLMs in Ableton Live
Hollowell, Sven, Namgyal, Tashi, Marshall, Paul
We introduce a system that allows users of Ableton Live to create MIDI-clips by naming them with musical descriptions. Users can compose by typing the desired musical content directly in Ableton's clip view, which is then inserted by our integrated system. This allows users to stay in the flow of their creative process while quickly generating musical ideas. The system works by prompting ChatGPT to reply using one of several text-based musical formats, such as ABC notation, chord symbols, or drum tablature. This is an important step in integrating generative AI tools into pre-existing musical workflows, and could be valuable for content makers who prefer to express their creative vision through descriptive language. Code is available at https://github.com/supersational/JAMMIN-GPT.
- Media > Music (1.00)
- Leisure & Entertainment (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.77)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.34)