AITopics | midi

Collaborating Authors

midi

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Audeo: AudioGenerationforaSilentPerformance Video

Neural Information Processing SystemsFeb-7-2026, 19:15:52 GMT

In the last step, we implement Midi synthesizers to generate realistic music.Audeoconverts video to audio smoothly and clearly withonlyafewsetupconstraints.Weevaluate Audeoonpianoperformancevideos collected from YouTube and obtain that their generated music is of reasonable audio quality andcanbesuccessfully recognized withhighprecision bypopular music identification software. The source code with examples is available in a Githubrepository3.

artificial intelligence, arxivpreprintarxiv, machine learning, (17 more...)

Neural Information Processing Systems

Country:

North America > United States > California > Santa Clara County > Stanford (0.04)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
Europe > Ireland (0.04)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)

Add feedback

HNote: Extending YNote with Hexadecimal Encoding for Fine-Tuning LLMs in Music Modeling

Chu, Hung-Ying, Wei, Shao-Yu, Chen, Guan-Wei, Hung, Tzu-Wei, Tsai, ChengYang, Lin, Yu-Cheng

arXiv.org Artificial IntelligenceOct-7-2025

Recent advances in large language models (LLMs) have created new opportunities for symbolic music generation. However, existing formats such as MIDI, ABC, and MusicXML are either overly complex or structurally inconsistent, limiting their suitability for token-based learning architectures. To address these challenges, we propose HNote, a novel hexadecimal-based notation system extended from YNote, which encodes both pitch and duration within a fixed 32-unit measure framework. This design ensures alignment, reduces ambiguity, and is directly compatible with LLM architectures. We converted 12,300 Jiangnan-style songs generated from traditional folk pieces from YNote into HNote, and fine-tuned LLaMA-3.1(8B) using parameter-efficient LoRA. Experimental results show that HNote achieves a syntactic correctness rate of 82.5%, and BLEU and ROUGE evaluations demonstrate strong symbolic and structural similarity, producing stylistically coherent compositions. This study establishes HNote as an effective framework for integrating LLMs with cultural music modeling.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2509.25694

Country: Asia (0.28)

Genre: Research Report > New Finding (1.00)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.50)

Add feedback

227f6afd3b7f89b96c4bb91f95d50f6d-Paper.pdf

Neural Information Processing SystemsOct-2-2025, 11:07:19 GMT

artificial intelligence, machine learning, midi, (17 more...)

Neural Information Processing Systems

Country: North America > United States (0.28)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Speech (0.94)

Add feedback

Beat and Downbeat Tracking in Performance MIDI Using an End-to-End Transformer Architecture

Murgul, Sebastian, Heizmann, Michael

arXiv.org Artificial IntelligenceJul-2-2025

Beat tracking in musical performance MIDI is a challenging and important task for notation-level music transcription and rhythmical analysis, yet existing methods primarily focus on audio-based approaches. This paper proposes an end-to-end transformer-based model for beat and downbeat tracking in performance MIDI, leveraging an encoder-decoder architecture for sequence-to-sequence translation of MIDI input to beat annotations. Our approach introduces novel data preprocessing techniques, including dynamic augmentation and optimized tokenization strategies, to improve accuracy and generalizability across different datasets. We conduct extensive experiments using the A-MAPS, ASAP, GuitarSet, and Leduc datasets, comparing our model against state-of-the-art hidden Markov models (HMMs) and deep learning-based beat tracking methods. The results demonstrate that our model outperforms existing symbolic music beat tracking approaches, achieving competitive F1-scores across various musical styles and instruments. Our findings highlight the potential of transformer architectures for symbolic beat tracking and suggest future integration with automatic music transcription systems for enhanced music analysis and score generation.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2507.00466

Country: Europe > Germany (0.14)

Genre: Research Report > New Finding (1.00)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (1.00)

Add feedback

Video Soundtrack Generation by Aligning Emotions and Temporal Boundaries

Sulun, Serkan, Viana, Paula, Davies, Matthew E. P.

arXiv.org Artificial IntelligenceFeb-14-2025

We introduce EMSYNC, a video-based symbolic music generation model that aligns music with a video's emotional content and temporal boundaries. It follows a two-stage framework, where a pretrained video emotion classifier extracts emotional features, and a conditional music generator produces MIDI sequences guided by both emotional and temporal cues. We introduce boundary offsets, a novel temporal conditioning mechanism that enables the model to anticipate and align musical chords with scene cuts. Unlike existing models, our approach retains event-based encoding, ensuring fine-grained timing control and expressive musical nuances. We also propose a mapping scheme to bridge the video emotion classifier, which produces discrete emotion categories, with the emotion-conditioned MIDI generator, which operates on continuous-valued valence-arousal inputs. In subjective listening tests, EMSYNC outperforms state-of-the-art models across all subjective metrics, for music theory-aware participants as well as the general listeners.

artificial intelligence, boundary, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2502.10154

Country: North America > United States (0.28)

Genre: Research Report > Promising Solution (0.34)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Emotion (0.68)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.66)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

PIAST: A Multimodal Piano Dataset with Audio, Symbolic and Text

Bang, Hayeon, Choi, Eunjin, Finch, Megan, Doh, Seungheon, Lee, Seolhee, Lee, Gyeong-Hoon, Nam, Juhan

arXiv.org Artificial IntelligenceNov-7-2024

While piano music has become a significant area of study in Music Information Retrieval (MIR), there is a notable lack of datasets for piano solo music with text labels. To address this gap, we present PIAST (PIano dataset with Audio, Symbolic, and Text), a piano music dataset. Utilizing a piano-specific taxonomy of semantic tags, we collected 9,673 tracks from YouTube and added human annotations for 2,023 tracks by music experts, resulting in two subsets: PIAST-YT and PIAST-AT. Both include audio, text, tag annotations, and transcribed MIDI utilizing state-of-the-art piano transcription and beat tracking models. Among many possible tasks with the multi-modal dataset, we conduct music tagging and retrieval using both audio and MIDI data and report baseline performances to demonstrate its potential as a valuable resource for MIR research.

dataset, music, proceedings, (12 more...)

arXiv.org Artificial Intelligence

2411.02551

Country: Asia > South Korea (0.05)

Genre: Research Report (0.50)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

CLaMP 2: Multimodal Music Information Retrieval Across 101 Languages Using Large Language Models

Wu, Shangda, Wang, Yashan, Yuan, Ruibin, Guo, Zhancheng, Tan, Xu, Zhang, Ge, Zhou, Monan, Chen, Jing, Mu, Xuefeng, Gao, Yuejie, Dong, Yuanliang, Liu, Jiafeng, Li, Xiaobing, Yu, Feng, Sun, Maosong

arXiv.org Artificial IntelligenceOct-17-2024

Challenges in managing linguistic diversity and integrating various musical modalities are faced by current music information retrieval systems. These limitations reduce their effectiveness in a global, multimodal music environment. To address these issues, we introduce CLaMP 2, a system compatible with 101 languages that supports both ABC notation (a text-based musical notation format) and MIDI (Musical Instrument Digital Interface) for music information retrieval. CLaMP 2, pre-trained on 1.5 million ABC-MIDI-text triplets, includes a multilingual text encoder and a multimodal music encoder aligned via contrastive learning. By leveraging large language models, we obtain refined and consistent multilingual descriptions at scale, significantly reducing textual noise and balancing language distribution. Our experiments show that CLaMP 2 achieves state-of-the-art results in both multilingual semantic search and music classification across modalities, thus establishing a new standard for inclusive and global music information retrieval.

information retrieval, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2410.13267

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > Rhode Island (0.04)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
(14 more...)

Genre: Research Report (0.64)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)
Education > Curriculum > Subject-Specific Education (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

Toward Fully Self-Supervised Multi-Pitch Estimation

Cwitkowitz, Frank, Duan, Zhiyao

arXiv.org Artificial IntelligenceFeb-23-2024

Multi-pitch estimation is a decades-long research problem involving the detection of pitch activity associated with concurrent musical events within multi-instrument mixtures. Supervised learning techniques have demonstrated solid performance on more narrow characterizations of the task, but suffer from limitations concerning the shortage of large-scale and diverse polyphonic music datasets with multi-pitch annotations. We present a suite of self-supervised learning objectives for multi-pitch estimation, which encourage the concentration of support around harmonics, invariance to timbral transformations, and equivariance to geometric transformations. These objectives are sufficient to train an entirely convolutional autoencoder to produce multi-pitch salience-grams directly, without any fine-tuning. Despite training exclusively on a collection of synthetic single-note audio samples, our fully self-supervised framework generalizes to polyphonic music mixtures, and achieves performance comparable to supervised models trained on conventional multi-pitch datasets.

objective, proceedings, transcription, (16 more...)

arXiv.org Artificial Intelligence

2402.15569

Country:

Asia > Japan > Honshū > Chūbu > Toyama Prefecture > Toyama (0.04)
North America > United States > New York > Monroe County > Rochester (0.04)

Genre: Research Report (0.50)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (1.00)

Add feedback

Expressive Acoustic Guitar Sound Synthesis with an Instrument-Specific Input Representation and Diffusion Outpainting

Kim, Hounsu, Choi, Soonbeom, Nam, Juhan

arXiv.org Artificial IntelligenceJan-24-2024

Synthesizing performing guitar sound is a highly challenging task due to the polyphony and high variability in expression. Recently, deep generative models have shown promising results in synthesizing expressive polyphonic instrument sounds from music scores, often using a generic MIDI input. In this work, we propose an expressive acoustic guitar sound synthesis model with a customized input representation to the instrument, which we call guitarroll. We implement the proposed approach using diffusion-based outpainting which can generate audio with long-term consistency. To overcome the lack of MIDI/audio-paired datasets, we used not only an existing guitar dataset but also collected data from a high quality sample-based guitar synthesizer. Through quantitative and qualitative evaluations, we show that our proposed model has higher audio quality than the baseline model and generates more realistic timbre sounds than the previous leading work.

mel spectrogram, representation, synthesis, (13 more...)

arXiv.org Artificial Intelligence

2401.13498

Country:

North America > United States (0.04)
North America > Puerto Rico > San Juan > San Juan (0.04)
Europe > Italy (0.04)
(2 more...)

Genre: Research Report (0.50)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)

Add feedback

JAMMIN-GPT: Text-based Improvisation using LLMs in Ableton Live

Hollowell, Sven, Namgyal, Tashi, Marshall, Paul

arXiv.org Artificial IntelligenceDec-6-2023

We introduce a system that allows users of Ableton Live to create MIDI-clips by naming them with musical descriptions. Users can compose by typing the desired musical content directly in Ableton's clip view, which is then inserted by our integrated system. This allows users to stay in the flow of their creative process while quickly generating musical ideas. The system works by prompting ChatGPT to reply using one of several text-based musical formats, such as ABC notation, chord symbols, or drum tablature. This is an important step in integrating generative AI tools into pre-existing musical workflows, and could be valuable for content makers who prefer to express their creative vision through descriptive language. Code is available at https://github.com/supersational/JAMMIN-GPT.

ableton live, chatgpt, jammin-gpt, (14 more...)

arXiv.org Artificial Intelligence

2312.03479

Country: Europe > United Kingdom > England > Bristol (0.04)

Genre: Research Report (0.40)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.77)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.34)

Add feedback