AITopics | Benetos, Emmanouil

Collaborating Authors

Benetos, Emmanouil

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training

Li, Yizhi, Yuan, Ruibin, Zhang, Ge, Ma, Yinghao, Chen, Xingran, Yin, Hanzhi, Lin, Chenghua, Ragni, Anton, Benetos, Emmanouil, Gyenge, Norbert, Dannenberg, Roger, Liu, Ruibo, Chen, Wenhu, Xia, Gus, Shi, Yemin, Huang, Wenhao, Guo, Yike, Fu, Jie

arXiv.org Artificial IntelligenceJun-6-2023

Self-supervised learning (SSL) has recently emerged as a promising paradigm for training generalisable models on large-scale data in the fields of vision, text, and speech. Although SSL has been proven effective in speech and audio, its application to music audio has yet to be thoroughly explored. This is primarily due to the distinctive challenges associated with modelling musical knowledge, particularly its tonal and pitched characteristics of music. To address this research gap, we propose an acoustic Music undERstanding model with large-scale self-supervised Training (MERT), which incorporates teacher models to provide pseudo labels in the masked language modelling (MLM) style acoustic pre-training. In our exploration, we identified a superior combination of teacher models, which outperforms conventional speech and audio approaches in terms of performance. This combination includes an acoustic teacher based on Residual Vector Quantization - Variational AutoEncoder (RVQ-VAE) and a musical teacher based on the Constant-Q Transform (CQT). These teachers effectively guide our student model, a BERT-style transformer encoder, to better model music audio. In addition, we introduce an in-batch noise mixture augmentation to enhance the representation robustness. Furthermore, we explore a wide range of settings to overcome the instability in acoustic language model pre-training, which allows our designed paradigm to scale from 95M to 330M parameters. Experimental results indicate that our model can generalise and perform well on 14 music understanding tasks and attains state-of-the-art (SOTA) overall scores. The code and models are online: https://github.com/yizhilll/MERT.

artificial intelligence, machine learning, representation, (12 more...)

arXiv.org Artificial Intelligence

2306.00107

Country:

Europe (0.93)
North America > United States (0.46)
Asia > China (0.28)

Genre: Research Report (1.00)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)
Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.96)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

MAP-Music2Vec: A Simple and Effective Baseline for Self-Supervised Music Audio Representation Learning

Li, Yizhi, Yuan, Ruibin, Zhang, Ge, Ma, Yinghao, Lin, Chenghua, Chen, Xingran, Ragni, Anton, Yin, Hanzhi, Hu, Zhijie, He, Haoyu, Benetos, Emmanouil, Gyenge, Norbert, Liu, Ruibo, Fu, Jie

arXiv.org Artificial IntelligenceDec-5-2022

The deep learning community has witnessed an exponentially growing interest in self-supervised learning (SSL). However, it still remains unexplored how to build a framework for learning useful representations of raw music waveforms in a self-supervised manner. In this work, we design Music2Vec, a framework exploring different SSL algorithmic components and tricks for music audio recordings. Our model achieves comparable results to the state-of-the-art (SOTA) music SSL model Jukebox, despite being significantly smaller with less than 2% of parameters of the latter. The model will be released on Huggingface(Please refer to: https://huggingface.co/m-a-p/music2vec-v1)

artificial intelligence, machine learning, representation, (10 more...)

arXiv.org Artificial Intelligence

2212.02508

Country:

Asia (0.48)
Europe (0.47)
North America > United States (0.30)

Genre: Research Report (0.50)

Industry:

Media > Music (0.71)
Leisure & Entertainment (0.71)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.34)

Add feedback

Reliable Local Explanations for Machine Listening

Mishra, Saumitra, Benetos, Emmanouil, Sturm, Bob L., Dixon, Simon

arXiv.org Machine LearningMay-15-2020

One way to analyse the behaviour of machine learning models is through local explanations that highlight input features that maximally influence model predictions. Sensitivity analysis, which involves analysing the effect of input perturbations on model predictions, is one of the methods to generate local explanations. Meaningful input perturbations are essential for generating reliable explanations, but there exists limited work on what such perturbations are and how to perform them. This work investigates these questions in the context of machine listening models that analyse audio. Specifically, we use a state-of-the-art deep singing voice detection (SVD) model to analyse whether explanations from SoundLIME (a local explanation method) are sensitive to how the method perturbs model inputs. The results demonstrate that SoundLIME explanations are sensitive to the content in the occluded input regions. We further propose and demonstrate a novel method for quantitatively identifying suitable content type(s) for reliably occluding inputs of machine listening models. The results for the SVD model suggest that the average magnitude of input mel-spectrogram bins is the most suitable content type for temporal explanations.

artificial intelligence, explanation, neural network, (17 more...)

arXiv.org Machine Learning

2005.07788

Genre: Research Report > New Finding (0.89)

Industry:

Media > Music (0.93)
Leisure & Entertainment (0.93)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

GAN-based Generation and Automatic Selection of Explanations for Neural Networks

Mishra, Saumitra, Stoller, Daniel, Benetos, Emmanouil, Sturm, Bob L., Dixon, Simon

arXiv.org Machine LearningApr-27-2019

One way to interpret trained deep neural networks (DNNs) is by inspecting characteristics that neurons in the model respond to, such as by iteratively optimising the model input (e.g., an image) to maximally activate specific neurons. However, this requires a careful selection of hyper-parameters to generate interpretable examples for each neuron of interest, and current methods rely on a manual, qualitative evaluation of each setting, which is prohibitively slow. We introduce a new metric that uses Fr\'echet Inception Distance (FID) to encourage similarity between model activations for real and generated data. This provides an efficient way to evaluate a set of generated examples for each setting of hyper-parameters. We also propose a novel GAN-based method for generating explanations that enables an efficient search through the input space and imposes a strong prior favouring realistic outputs. We apply our approach to a classification model trained to predict whether a music audio recording contains singing voice. Our results suggest that this proposed metric successfully selects hyper-parameters leading to interpretable examples, avoiding the need for manual evaluation. Moreover, we see that examples synthesised to maximise or minimise the predicted probability of singing voice presence exhibit vocal or non-vocal characteristics, respectively, suggesting that our approach is able to generate suitable explanations for understanding concepts learned by a neural network.

classifier, deep learning, neural network, (18 more...)

arXiv.org Machine Learning

1904.09533

Genre: Research Report > New Finding (0.68)

Industry:

Media > Music (0.67)
Leisure & Entertainment (0.67)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Add feedback

Sound Event Detection in Synthetic Audio: Analysis of the DCASE 2016 Task Results

Lafay, Grégoire, Benetos, Emmanouil, Lagrange, Mathieu

arXiv.org Machine LearningNov-15-2017

As part of the 2016 public evaluation challenge on Detection and Classification of Acoustic Scenes and Events (DCASE 2016), the second task focused on evaluating sound event detection systems using synthetic mixtures of office sounds. This task, which follows the `Event Detection - Office Synthetic' task of DCASE 2013, studies the behaviour of tested algorithms when facing controlled levels of audio complexity with respect to background noise and polyphony/density, with the added benefit of a very accurate ground truth. This paper presents the task formulation, evaluation metrics, submitted systems, and provides a statistical analysis of the results achieved, with respect to various aspects of the evaluation dataset.

dcase 2016, deep learning, neural network, (20 more...)

arXiv.org Machine Learning

1711.05551

Country: Europe > France (0.14)

Genre: Research Report > Experimental Study (0.46)

Industry: Health & Medicine (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.95)

Add feedback

An End-to-End Neural Network for Polyphonic Piano Music Transcription

Sigtia, Siddharth, Benetos, Emmanouil, Dixon, Simon

arXiv.org Machine LearningFeb-11-2016

We present a supervised neural network model for polyphonic piano music transcription. The architecture of the proposed model is analogous to speech recognition systems and comprises an acoustic model and a music language model. The acoustic model is a neural network used for estimating the probabilities of pitches in a frame of audio. The language model is a recurrent neural network that models the correlations between pitch combinations over time. The proposed model is general and can be used to transcribe polyphonic music without imposing any constraints on the polyphony. The acoustic and language model predictions are combined using a probabilistic graphical model. Inference over the output variables is performed using the beam search algorithm. We perform two sets of experiments. We investigate various neural network architectures for the acoustic models and also investigate the effect of combining acoustic and music language model predictions using the proposed architecture. We compare performance of the neural network based acoustic models with two popular unsupervised acoustic models. Results show that convolutional neural network acoustic models yields the best performance across all evaluation metrics. We also observe improved performance with the application of the music language models. Finally, we present an efficient variant of beam search that improves performance and reduces run-times by an order of magnitude, making the model suitable for real-time applications.

acoustic model, deep learning, neural network, (19 more...)

arXiv.org Machine Learning

1508.01774

Country: Oceania > Australia (0.14)

Genre: Research Report > New Finding (1.00)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.68)

Add feedback

An evaluation framework for event detection using a morphological model of acoustic scenes

Lagrange, Mathieu, Lafay, Grégoire, Rossignol, Mathias, Benetos, Emmanouil, Roebel, Axel

arXiv.org Machine LearningJan-31-2015

This paper introduces a model of environmental acoustic scenes which adopts a morphological approach by ab-stracting temporal structures of acoustic scenes. To demonstrate its potential, this model is employed to evaluate the performance of a large set of acoustic events detection systems. This model allows us to explicitly control key morphological aspects of the acoustic scene and isolate their impact on the performance of the system under evaluation. Thus, more information can be gained on the behavior of evaluated systems, providing guidance for further improvements. The proposed model is validated using submitted systems from the IEEE DCASE Challenge; results indicate that the proposed scheme is able to successfully build datasets useful for evaluating some aspects the performance of event detection systems, more particularly their robustness to new listening conditions and the increasing level of background sounds.

artificial intelligence, neural network, texture, (19 more...)

arXiv.org Machine Learning

1502.00141

Country:

Europe (0.68)
North America > United States > New York (0.14)
North America > United States > New Jersey (0.14)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Cognitive Science (0.68)
Information Technology > Artificial Intelligence > Natural Language (0.68)

Add feedback