Beat Tracking as Object Detection

Ahn, Jaehoon, Jung, Moon-Ryul

arXiv.org Artificial Intelligence 

We propose reframing this task as object detection, where beats and downbeats are modeled as temporal "objects." Adapting the FCOS detector from computer vision to 1D audio, we replace its original backbone with WaveBeat's temporal feature extractor and add a Feature Pyramid Network to capture multi-scale temporal patterns. The model predicts overlapping beat/downbeat intervals with confidence scores, followed by non-maximum suppression (NMS) to select final predictions. This NMS step serves a similar role to DBNs in traditional trackers, but is simpler and less heuristic. Evaluated on standard music datasets, our approach achieves competitive results, showing that object detection techniques can effectively model musical beats with minimal adaptation. 1. INTRODUCTION Beat tracking is a field of research in music information retrieval (MIR) which includes the task of beat and downbeat tracking, in which beat and downbeat positions are computationally predicted in music audio.