AITopics | Lagrange, Mathieu

Plotting

Lagrange, Mathieu

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Sound Scene Synthesis at the DCASE 2024 Challenge

Lagrange, Mathieu, Lee, Junwon, Tailleur, Modan, Heller, Laurie M., Choi, Keunwoo, McFee, Brian, Imoto, Keisuke, Okamoto, Yuki

arXiv.org Artificial IntelligenceJan-15-2025

This paper presents Task 7 at the DCASE 2024 Challenge: sound scene synthesis. Recent advances in sound synthesis and generative models have enabled the creation of realistic and diverse audio content. We introduce a standardized evaluation framework for comparing different sound scene synthesis systems, incorporating both objective and subjective metrics. The challenge attracted four submissions, which are evaluated using the Fr\'echet Audio Distance (FAD) and human perceptual ratings. Our analysis reveals significant insights into the current capabilities and limitations of sound scene synthesis systems, while also highlighting areas for future improvement in this rapidly evolving field.

artificial intelligence, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2501.08587

Country:

Asia (0.95)
Europe (0.69)
North America > United States (0.29)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Challenge on Sound Scene Synthesis: Evaluating Text-to-Audio Generation

Lee, Junwon, Tailleur, Modan, Heller, Laurie M., Choi, Keunwoo, Lagrange, Mathieu, McFee, Brian, Imoto, Keisuke, Okamoto, Yuki

arXiv.org Artificial IntelligenceOct-23-2024

Despite significant advancements in neural text-to-audio generation, challenges persist in controllability and evaluation. This paper addresses these issues through the Sound Scene Synthesis challenge held as part of the Detection and Classification of Acoustic Scenes and Events 2024. We present an evaluation protocol combining objective metric, namely Fr\'echet Audio Distance, with perceptual assessments, utilizing a structured prompt format to enable diverse captions and effective evaluation. Our analysis reveals varying performance across sound categories and model architectures, with larger models generally excelling but innovative lightweight approaches also showing promise. The strong correlation between objective metrics and human ratings validates our evaluation approach. We discuss outcomes in terms of audio quality, controllability, and architectural considerations for text-to-audio synthesizers, providing direction for future research.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2410.17589

Country:

Asia > China (0.14)
Europe > United Kingdom (0.14)

Genre: Research Report (0.83)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

EMVD dataset: a dataset of extreme vocal distortion techniques used in heavy metal

Tailleur, Modan, Pinquier, Julien, Millot, Laurent, Vogel, Corsin, Lagrange, Mathieu

arXiv.org Artificial IntelligenceJun-24-2024

In this paper, we introduce the Extreme Metal Vocals Dataset, which comprises a collection of recordings of extreme vocal techniques performed within the realm of heavy metal music. The dataset consists of 760 audio excerpts of 1 second to 30 seconds long, totaling about 100 min of audio material, roughly composed of 60 minutes of distorted voices and 40 minutes of clear voice recordings. These vocal recordings are from 27 different singers and are provided without accompanying musical instruments or post-processing effects. The distortion taxonomy within this dataset encompasses four distinct distortion techniques and three vocal effects, all performed in different pitch ranges. Performance of a state-of-the-art deep learning model is evaluated for two different classification tasks related to vocal techniques, demonstrating the potential of this resource for the audio processing community.

artificial intelligence, deep learning, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2406.17732

Country:

Europe > France (0.31)
North America > United States > Ohio (0.14)

Genre: Research Report (0.40)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.87)

Add feedback

Fitting Auditory Filterbanks with Multiresolution Neural Networks

Lostanlen, Vincent, Haider, Daniel, Han, Han, Lagrange, Mathieu, Balazs, Peter, Ehler, Martin

arXiv.org Artificial IntelligenceJul-25-2023

Waveform-based deep learning faces a dilemma between nonparametric and parametric approaches. On one hand, convolutional neural networks (convnets) may approximate any linear time-invariant system; yet, in practice, their frequency responses become more irregular as their receptive fields grow. On the other hand, a parametric model such as LEAF is guaranteed to yield Gabor filters, hence an optimal time-frequency localization; yet, this strong inductive bias comes at the detriment of representational capacity. In this paper, we aim to overcome this dilemma by introducing a neural audio model, named multiresolution neural network (MuReNN). The key idea behind MuReNN is to train separate convolutional operators over the octave subbands of a discrete wavelet transform (DWT). Since the scale of DWT atoms grows exponentially between octaves, the receptive fields of the subsequent learnable convolutions in MuReNN are dilated accordingly. For a given real-world dataset, we fit the magnitude response of MuReNN to that of a well-established auditory filterbank: Gammatone for speech, CQT for music, and third-octave for urban sounds, respectively. This is a form of knowledge distillation (KD), in which the filterbank ''teacher'' is engineered by domain knowledge while the neural network ''student'' is optimized from data. We compare MuReNN to the state of the art in terms of goodness of fit after KD on a hold-out set and in terms of Heisenberg time-frequency localization. Compared to convnets and Gabor convolutions, we find that MuReNN reaches state-of-the-art performance on all three optimization problems.

artificial intelligence, filterbank, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2307.13821

Country:

Europe > Austria > Vienna (0.15)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
North America > United States > Texas (0.14)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.90)

Add feedback

Perceptual-Neural-Physical Sound Matching

Han, Han, Lostanlen, Vincent, Lagrange, Mathieu

arXiv.org Artificial IntelligenceMar-13-2023

Sound matching algorithms seek to approximate a target waveform by parametric audio synthesis. Deep neural networks have achieved promising results in matching sustained harmonic tones. However, the task is more challenging when targets are nonstationary and inharmonic, e.g., percussion. We attribute this problem to the inadequacy of loss function. On one hand, mean square error in the parametric domain, known as "P-loss", is simple and fast but fails to accommodate the differing perceptual significance of each parameter. On the other hand, mean square error in the spectrotemporal domain, known as "spectral loss", is perceptually motivated and serves in differentiable digital signal processing (DDSP). Yet, spectral loss is a poor predictor of pitch intervals and its gradient may be computationally expensive; hence a slow convergence. Against this conundrum, we present Perceptual-Neural-Physical loss (PNP). PNP is the optimal quadratic approximation of spectral loss while being as fast as P-loss during training. We instantiate PNP with physical modeling synthesis as decoder and joint time-frequency scattering transform (JTFS) as spectral representation. We demonstrate its potential on matching synthetic drum sounds in comparison with other loss functions.

artificial intelligence, machine learning, spectral loss, (18 more...)

arXiv.org Artificial Intelligence

2301.02886

Genre: Research Report (0.50)

Industry:

Media > Music (0.93)
Leisure & Entertainment (0.93)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

Mesostructures: Beyond Spectrogram Loss in Differentiable Time-Frequency Analysis

Vahidi, Cyrus, Han, Han, Wang, Changhong, Lagrange, Mathieu, Fazekas, György, Lostanlen, Vincent

arXiv.org Artificial IntelligenceJan-24-2023

Computer musicians refer to mesostructures as the intermediate levels of articulation between the microstructure of waveshapes and the macrostructure of musical forms. Examples of mesostructures include melody, arpeggios, syncopation, polyphonic grouping, and textural contrast. Despite their central role in musical expression, they have received limited attention in deep learning. Currently, autoencoders and neural audio synthesizers are only trained and evaluated at the scale of microstructure: i.e., local amplitude variations up to 100 milliseconds or so. In this paper, we formulate and address the problem of mesostructural audio modeling via a composition of a differentiable arpeggiator and time-frequency scattering. We empirically demonstrate that time--frequency scattering serves as a differentiable model of similarity between synthesis parameters that govern mesostructure. By exposing the sensitivity of short-time spectral distances to time alignment, we motivate the need for a time-invariant and multiscale differentiable time--frequency model of similarity at the level of both local spectra and spectrotemporal modulations.

artificial intelligence, machine learning, time shift, (18 more...)

arXiv.org Artificial Intelligence

2301.10183

Country: Europe (0.68)

Genre: Research Report (0.50)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

Sound Event Detection in Synthetic Audio: Analysis of the DCASE 2016 Task Results

Lafay, Grégoire, Benetos, Emmanouil, Lagrange, Mathieu

arXiv.org Machine LearningNov-15-2017

As part of the 2016 public evaluation challenge on Detection and Classification of Acoustic Scenes and Events (DCASE 2016), the second task focused on evaluating sound event detection systems using synthetic mixtures of office sounds. This task, which follows the `Event Detection - Office Synthetic' task of DCASE 2013, studies the behaviour of tested algorithms when facing controlled levels of audio complexity with respect to background noise and polyphony/density, with the added benefit of a very accurate ground truth. This paper presents the task formulation, evaluation metrics, submitted systems, and provides a statistical analysis of the results achieved, with respect to various aspects of the evaluation dataset.

dcase 2016, deep learning, neural network, (20 more...)

arXiv.org Machine Learning

1711.05551

Country: Europe > France (0.14)

Genre: Research Report > Experimental Study (0.46)

Industry: Health & Medicine (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.95)

Add feedback

The bag-of-frames approach: a not so sufficient model for urban soundscapes

Lagrange, Mathieu, Lafay, Grégoire, Defreville, Boris, Aucouturier, Jean-Julien

arXiv.org Machine LearningJun-12-2016

The "bag-of-frames" approach (BOF), which encodes audio signals as the long-term statistical distribution of short-term spectral features, is commonly regarded as an effective and sufficient way to represent environmental sound recordings (soundscapes) since its introduction in an influential 2007 article. The present paper describes a concep-tual replication of this seminal article using several new soundscape datasets, with results strongly questioning the adequacy of the BOF approach for the task. We show that the good accuracy originally re-ported with BOF likely result from a particularly thankful dataset with low within-class variability, and that for more realistic datasets, BOF in fact does not perform significantly better than a mere one-point av-erage of the signal's features. Soundscape modeling, therefore, may not be the closed case it was once thought to be. Progress, we ar-gue, could lie in reconsidering the problem of considering individual acoustical events within each soundscape.

artificial intelligence, dataset, machine learning, (18 more...)

arXiv.org Machine Learning

1412.4052

Country: Europe > France (0.31)

Genre: Research Report > New Finding (0.95)

Industry:

Media > Music (0.48)
Leisure & Entertainment (0.48)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.47)

Add feedback

An evaluation framework for event detection using a morphological model of acoustic scenes

Lagrange, Mathieu, Lafay, Grégoire, Rossignol, Mathias, Benetos, Emmanouil, Roebel, Axel

arXiv.org Machine LearningJan-31-2015

This paper introduces a model of environmental acoustic scenes which adopts a morphological approach by ab-stracting temporal structures of acoustic scenes. To demonstrate its potential, this model is employed to evaluate the performance of a large set of acoustic events detection systems. This model allows us to explicitly control key morphological aspects of the acoustic scene and isolate their impact on the performance of the system under evaluation. Thus, more information can be gained on the behavior of evaluated systems, providing guidance for further improvements. The proposed model is validated using submitted systems from the IEEE DCASE Challenge; results indicate that the proposed scheme is able to successfully build datasets useful for evaluating some aspects the performance of event detection systems, more particularly their robustness to new listening conditions and the increasing level of background sounds.

artificial intelligence, neural network, texture, (19 more...)

arXiv.org Machine Learning

1502.00141

Country:

Europe (0.68)
North America > United States > New York (0.14)
North America > United States > New Jersey (0.14)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Cognitive Science (0.68)
Information Technology > Artificial Intelligence > Natural Language (0.68)

Add feedback