PianoVAM: A Multimodal Piano Performance Dataset

Kim, Yonghyun, Park, Junhyung, Bae, Joonhyung, Kim, Kirak, Kwon, Taegyun, Lerch, Alexander, Nam, Juhan

Sep-11-2025–arXiv.org Artificial Intelligence

The multimodal nature of music performance has driven increasing interest in data beyond the audio domain within the music information retrieval (MIR) community. This paper introduces PianoVAM, a comprehensive piano performance dataset that includes videos, audio, MIDI, hand landmarks, fingering labels, and rich metadata. The dataset was recorded using a Disklavier piano, capturing audio and MIDI from amateur pianists during their daily practice sessions, alongside synchronized top-view videos in realistic and varied performance conditions. Hand landmarks and fingering labels were extracted using a pretrained hand pose estimation model and a semi-automated fingering annotation algorithm. We discuss the challenges encountered during data collection and the alignment process across different modalities. Additionally, we describe our fingering annotation method based on hand landmarks extracted from videos. Finally, we present benchmarking results for both audio-only and audio-visual piano transcription using the PianoVAM dataset and discuss additional potential applications.

information retrieval, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

Sep-11-2025

arXiv.org PDF

Add feedback

Country:
- Asia > South Korea (0.14)

Genre:
- Research Report
  - New Finding (0.68)
  - Experimental Study (0.68)

Industry:
- Media > Music (1.00)
- Leisure & Entertainment (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning (1.00)
  - Vision (0.66)
  - Natural Language > Information Retrieval (0.35)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found