Dixon, Simon
Wave-U-Net: A Multi-Scale Neural Network for End-to-End Audio Source Separation
Stoller, Daniel, Ewert, Sebastian, Dixon, Simon
Models for audio source separation usually operate on the magnitude spectrum, which ignores phase information and makes separation performance dependant on hyper-parameters for the spectral front-end. Therefore, we investigate end-to-end source separation in the time-domain, which allows modelling phase information and avoids fixed spectral transformations. Due to high sampling rates for audio, employing a long temporal input context on the sample level is difficult, but required for high quality separation results because of long-range temporal correlations. In this context, we propose the Wave-U-Net, an adaptation of the U-Net to the one-dimensional time domain, which repeatedly resamples feature maps to compute and combine features at different time scales. We introduce further architectural improvements, including an output layer that enforces source additivity, an upsampling technique and a context-aware prediction framework to reduce output artifacts. Experiments for singing voice separation indicate that our architecture yields a performance comparable to a state-of-the-art spectrogram-based U-Net architecture, given the same data. Finally, we reveal a problem with outliers in the currently used SDR evaluation metrics and suggest reporting rank-based statistics to alleviate this problem.
Note Value Recognition for Piano Transcription Using Markov Random Fields
Nakamura, Eita, Yoshii, Kazuyoshi, Dixon, Simon
This paper presents a statistical method for use in music transcription that can estimate score times of note onsets and offsets from polyphonic MIDI performance signals. Because performed note durations can deviate largely from score-indicated values, previous methods had the problem of not being able to accurately estimate offset score times (or note values) and thus could only output incomplete musical scores. Based on observations that the pitch context and onset score times are influential on the configuration of note values, we construct a context-tree model that provides prior distributions of note values using these features and combine it with a performance model in the framework of Markov random fields. Evaluation results show that our method reduces the average error rate by around 40 percent compared to existing/simple methods. We also confirmed that, in our model, the score model plays a more important role than the performance model, and it automatically captures the voice structure by unsupervised learning.
An End-to-End Neural Network for Polyphonic Piano Music Transcription
Sigtia, Siddharth, Benetos, Emmanouil, Dixon, Simon
We present a supervised neural network model for polyphonic piano music transcription. The architecture of the proposed model is analogous to speech recognition systems and comprises an acoustic model and a music language model. The acoustic model is a neural network used for estimating the probabilities of pitches in a frame of audio. The language model is a recurrent neural network that models the correlations between pitch combinations over time. The proposed model is general and can be used to transcribe polyphonic music without imposing any constraints on the polyphony. The acoustic and language model predictions are combined using a probabilistic graphical model. Inference over the output variables is performed using the beam search algorithm. We perform two sets of experiments. We investigate various neural network architectures for the acoustic models and also investigate the effect of combining acoustic and music language model predictions using the proposed architecture. We compare performance of the neural network based acoustic models with two popular unsupervised acoustic models. Results show that convolutional neural network acoustic models yields the best performance across all evaluation metrics. We also observe improved performance with the application of the music language models. Finally, we present an efficient variant of beam search that improves performance and reduces run-times by an order of magnitude, making the model suitable for real-time applications.
Identifying Cover Songs Using Information-Theoretic Measures of Similarity
Foster, Peter, Dixon, Simon, Klapuri, Anssi
This paper investigates methods for quantifying similarity between audio signals, specifically for the task of of cover song detection. We consider an information-theoretic approach, where we compute pairwise measures of predictability between time series. We compare discrete-valued approaches operating on quantised audio features, to continuous-valued approaches. In the discrete case, we propose a method for computing the normalised compression distance, where we account for correlation between time series. In the continuous case, we propose to compute information-based measures of similarity as statistics of the prediction error between time series. We evaluate our methods on two cover song identification tasks using a data set comprised of 300 Jazz standards and using the Million Song Dataset. For both datasets, we observe that continuous-valued approaches outperform discrete-valued approaches. We consider approaches to estimating the normalised compression distance (NCD) based on string compression and prediction, where we observe that our proposed normalised compression distance with alignment (NCDA) improves average performance over NCD, for sequential compression algorithms. Finally, we demonstrate that continuous-valued distances may be combined to improve performance with respect to baseline approaches. Using a large-scale filter-and-refine approach, we demonstrate state-of-the-art performance for cover song identification using the Million Song Dataset.
In Search of the Horowitz Factor
Widmer, Gerhard, Dixon, Simon, Goebl, Werner, Pampalk, Elias, Tobudic, Asmir
The article introduces the reader to a large interdisciplinary research project whose goal is to use AI to gain new insight into a complex artistic phenomenon. We study fundamental principles of expressive music performance by measuring performance aspects in large numbers of recordings by highly skilled musicians (concert pianists) and analyzing the data with state-of-the-art methods from areas such as machine learning, data mining, and data visualization. The article first introduces the general research questions that guide the project and then summarizes some of the most important results achieved to date, with an emphasis on the most recent and still rather speculative work. Our current results show that it is possible for machines to make novel and interesting discoveries even in a domain such as music and that even if we might never find the "Horowitz Factor," AI can give us completely new insights into complex artistic behavior.
In Search of the Horowitz Factor
Widmer, Gerhard, Dixon, Simon, Goebl, Werner, Pampalk, Elias, Tobudic, Asmir
The article introduces the reader to a large interdisciplinary research project whose goal is to use AI to gain new insight into a complex artistic phenomenon. We study fundamental principles of expressive music performance by measuring performance aspects in large numbers of recordings by highly skilled musicians (concert pianists) and analyzing the data with state-of-the-art methods from areas such as machine learning, data mining, and data visualization. The article first introduces the general research questions that guide the project and then summarizes some of the most important results achieved to date, with an emphasis on the most recent and still rather speculative work. A broad view of the discovery process is given, from data acquisition through data visualization to inductive model building and pattern discovery, and it turns out that AI plays an important role in all stages of such an ambitious enterprise. Our current results show that it is possible for machines to make novel and interesting discoveries even in a domain such as music and that even if we might never find the "Horowitz Factor," AI can give us completely new insights into complex artistic behavior.