AITopics | Richard, Gaël

Collaborating Authors

Richard, Gaël

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

AnCoGen: Analysis, Control and Generation of Speech with a Masked Autoencoder

Sadok, Samir, Leglaive, Simon, Girin, Laurent, Richard, Gaël, Alameda-Pineda, Xavier

arXiv.org Artificial IntelligenceJan-9-2025

Abstract--This article introduces AnCoGen, a novel method that leverages a masked autoencoder to unify the analysis, control, and generation of speech signals within a single model. AnCoGen can analyze speech by estimating key attributes, such as speaker identity, pitch, content, loudness, signal-to-noise ratio, and clarity index. In addition, it can generate speech from these attributes and allow precise control of the synthesized speech by modifying them. Extensive experiments demonstrated the effectiveness of AnCoGen across speech analysisresynthesis, pitch estimation, pitch modification, and speech enhancement. Over the years, many speech processing algorithms have been In this paper, we introduce AnCoGen for analyzing, controlling, developed to analyze, transform, and synthesize speech signals.

ancogen, artificial intelligence, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2501.05332

Country: Europe > France (0.29)

Genre: Research Report > New Finding (0.66)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Multiple Choice Learning for Efficient Speech Separation with Many Speakers

Perera, David, Derrida, François, Mariotte, Théo, Richard, Gaël, Essid, Slim

arXiv.org Machine LearningNov-27-2024

Training speech separation models in the supervised setting raises a permutation problem: finding the best assignation between the model predictions and the ground truth separated signals. This inherently ambiguous task is customarily solved using Permutation Invariant Training (PIT). In this article, we instead consider using the Multiple Choice Learning (MCL) framework, which was originally introduced to tackle ambiguous tasks. We demonstrate experimentally on the popular WSJ0-mix and LibriMix benchmarks that MCL matches the performances of PIT, while being computationally advantageous. This opens the door to a promising research direction, as MCL can be naturally extended to handle a variable number of speakers, or to tackle speech separation in the unsupervised setting.

artificial intelligence, deep learning, machine learning, (17 more...)

arXiv.org Machine Learning

2411.18497

Country: Europe > France (0.14)

Genre: Research Report (0.82)

Industry: Education (0.62)

Technology:

Information Technology > Artificial Intelligence > Speech (0.97)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Episodic fine-tuning prototypical networks for optimization-based few-shot learning: Application to audio classification

Zhuang, Xuanyu, Peeters, Geoffroy, Richard, Gaël

arXiv.org Artificial IntelligenceOct-4-2024

The Prototypical Network (ProtoNet) has emerged as a popular choice in Few-shot Learning (FSL) scenarios due to its remarkable performance and straightforward implementation. Building upon such success, we first propose a simple (yet novel) method to fine-tune a ProtoNet on the (labeled) support set of the test episode of a C-way-K-shot test episode (without using the query set which is only used for evaluation). We then propose an algorithmic framework that combines ProtoNet with optimization-based FSL algorithms (MAML and Meta-Curvature) to work with such a fine-tuning method. Since optimization-based algorithms endow the target learner model with the ability to fast adaption to only a few samples, we utilize ProtoNet as the target model to enhance its fine-tuning performance with the help of a specifically designed episodic fine-tuning strategy. The experimental results confirm that our proposed models, MAML-Proto and MC-Proto, combined with our unique fine-tuning method, outperform regular ProtoNet by a large margin in few-shot audio classification tasks on the ESC-50 and Speech Commands v2 datasets. We note that although we have only applied our model to the audio domain, it is a general method and can be easily extended to other domains.

artificial intelligence, machine learning, protonet, (16 more...)

arXiv.org Artificial Intelligence

2410.05302

Genre:

Research Report > Promising Solution (0.34)
Research Report > New Finding (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)

Add feedback

Structure-informed Positional Encoding for Music Generation

Agarwal, Manvi, Wang, Changhong, Richard, Gaël

arXiv.org Artificial IntelligenceFeb-28-2024

Music generated by deep learning methods often suffers from a lack of coherence and long-term organization. Yet, multi-scale hierarchical structure is a distinctive feature of music signals. To leverage this information, we propose a structure-informed positional encoding framework for music generation with Transformers. We design three variants in terms of absolute, relative and non-stationary positional information. We comprehensively test them on two symbolic music generation tasks: next-timestep prediction and accompaniment generation. As a comparison, we choose multiple baselines from the literature and demonstrate the merits of our methods using several musically-motivated evaluation metrics. In particular, our methods improve the melodic and structural consistency of the generated pieces.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2402.13301

Country: Europe > France (0.14)

Genre: Research Report (0.50)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

Unsupervised Harmonic Parameter Estimation Using Differentiable DSP and Spectral Optimal Transport

Torres, Bernardo, Peeters, Geoffroy, Richard, Gaël

arXiv.org Artificial IntelligenceJan-15-2024

In neural audio signal processing, pitch conditioning has been used to enhance the performance of synthesizers. However, jointly training pitch estimators and synthesizers is a challenge when using standard audio-to-audio reconstruction loss, leading to reliance on external pitch trackers. To address this issue, we propose using a spectral loss function inspired by optimal transportation theory that minimizes the displacement of spectral energy. We validate this approach through an unsupervised autoencoding task that fits a harmonic template to harmonic signals. We jointly estimate the fundamental frequency and amplitudes of harmonics using a lightweight encoder and reconstruct the signals using a differentiable harmonic synthesizer. The proposed approach offers a promising direction for improving unsupervised parameter estimation in neural audio applications.

artificial intelligence, frequency, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2312.14507

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.84)

Add feedback

Singer Identity Representation Learning using Self-Supervised Techniques

Torres, Bernardo, Lattner, Stefan, Richard, Gaël

arXiv.org Artificial IntelligenceJan-10-2024

Significant strides have been made in creating voice identity representations using speech data. However, the same level of progress has not been achieved for singing voices. To bridge this gap, we suggest a framework for training singer identity encoders to extract representations suitable for various singing-related tasks, such as singing voice similarity and synthesis. We explore different self-supervised learning techniques on a large collection of isolated vocal tracks and apply data augmentations during training to ensure that the representations are invariant to pitch and content variations. We evaluate the quality of the resulting representations on singer similarity and identification tasks across multiple datasets, with a particular emphasis on out-of-domain generalization. Our proposed framework produces high-quality embeddings that outperform both speaker verification and wav2vec 2.0 pre-trained baselines on singing voice while operating at 44.1 kHz. We release our code and trained models to facilitate further research on singing voice and related areas.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

doi: 10.5281/zenodo.10265323

2401.05064

Country:

North America > United States (0.14)
Europe > Italy (0.14)
Europe > France (0.14)
Asia > Japan (0.14)

Genre: Research Report > New Finding (0.93)

Industry:

Media > Music (0.69)
Leisure & Entertainment (0.69)
Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.54)
(3 more...)

Add feedback

Resilient Multiple Choice Learning: A learned scoring scheme with application to audio scene analysis

Letzelter, Victor, Fontaine, Mathieu, Chen, Mickaël, Pérez, Patrick, Essid, Slim, Richard, Gaël

arXiv.org Machine LearningNov-16-2023

We introduce Resilient Multiple Choice Learning (rMCL), an extension of the MCL approach for conditional distribution estimation in regression settings where multiple targets may be sampled for each training input. Multiple Choice Learning is a simple framework to tackle multimodal density estimation, using the Winner-Takes-All (WTA) loss for a set of hypotheses. In regression settings, the existing MCL variants focus on merging the hypotheses, thereby eventually sacrificing the diversity of the predictions. In contrast, our method relies on a novel learned scoring scheme underpinned by a mathematical framework based on Voronoi tessellations of the output space, from which we can derive a probabilistic interpretation. After empirically validating rMCL with experiments on synthetic data, we further assess its merits on the sound source localization problem, demonstrating its practical usefulness and the relevance of its interpretation.

artificial intelligence, hypothesis, machine learning, (16 more...)

arXiv.org Machine Learning

2311.01052

Country: Europe > France (0.14)

Genre: Research Report (1.00)

Industry: Education (0.82)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.66)

Add feedback

Video-to-Music Recommendation using Temporal Alignment of Segments

Prétet, Laure, Richard, Gaël, Souchier, Clément, Peeters, Geoffroy

arXiv.org Artificial IntelligenceJun-12-2023

We study cross-modal recommendation of music tracks to be used as soundtracks for videos. This problem is known as the music supervision task. We build on a self-supervised system that learns a content association between music and video. In addition to the adequacy of content, adequacy of structure is crucial in music supervision to obtain relevant recommendations. We propose a novel approach to significantly improve the system's performance using structure-aware recommendation. The core idea is to consider not only the full audio-video clips, but rather shorter segments for training and inference. We find that using semantic segments and ranking the tracks according to sequence alignment costs significantly improves the results. We investigate the impact of different ranking metrics and segmentation methods.

artificial intelligence, machine learning, recommendation, (20 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/TMM.2022.3152598

2306.07187

Country:

Europe (1.00)
North America > United States > New York (0.15)
North America > United States > Massachusetts (0.14)

Genre:

Research Report > Promising Solution (0.48)
Overview > Innovation (0.34)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)

Technology:

Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
(2 more...)

Add feedback

Tackling Interpretability in Audio Classification Networks with Non-negative Matrix Factorization

Parekh, Jayneel, Parekh, Sanjeel, Mozharovskyi, Pavlo, Richard, Gaël, d'Alché-Buc, Florence

arXiv.org Artificial IntelligenceMay-11-2023

This paper tackles two major problem settings for interpretability of audio processing networks, post-hoc and by-design interpretation. For post-hoc interpretation, we aim to interpret decisions of a network in terms of high-level audio objects that are also listenable for the end-user. This is extended to present an inherently interpretable model with high performance. To this end, we propose a novel interpreter design that incorporates non-negative matrix factorization (NMF). In particular, an interpreter is trained to generate a regularized intermediate embedding from hidden layers of a target network, learnt as time-activations of a pre-learnt NMF dictionary. Our methodology allows us to generate intuitive audio-based interpretations that explicitly enhance parts of the input signal most relevant for a network's decision. We demonstrate our method's applicability on a variety of classification tasks, including multi-label data for real-world audio and music.

artificial intelligence, interpretation, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2305.07132

Country: North America > United States (0.28)

Genre: Research Report (0.82)

Industry:

Media > Music (0.46)
Leisure & Entertainment (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Rate-Distortion Theoretic Generalization Bounds for Stochastic Learning Algorithms

Sefidgaran, Milad, Gohari, Amin, Richard, Gaël, Şimşekli, Umut

arXiv.org Machine LearningJun-29-2022

Understanding generalization in modern machine learning settings has been one of the major challenges in statistical learning theory. In this context, recent years have witnessed the development of various generalization bounds suggesting different complexity notions such as the mutual information between the data sample and the algorithm output, compressibility of the hypothesis space, and the fractal dimension of the hypothesis space. While these bounds have illuminated the problem at hand from different angles, their suggested complexity notions might appear seemingly unrelated, thereby restricting their high-level impact. In this study, we prove novel generalization bounds through the lens of rate-distortion theory, and explicitly relate the concepts of mutual information, compressibility, and fractal dimensions in a single mathematical framework. Our approach consists of (i) defining a generalized notion of compressibility by using source coding concepts, and (ii) showing that the `compression error rate' can be linked to the generalization error both in expectation and with high probability. We show that in the `lossless compression' setting, we recover and improve existing mutual information-based bounds, whereas a `lossy compression' scheme allows us to link generalization to the rate-distortion dimension -- a particular notion of fractal dimension. Our results bring a more unified perspective on generalization and open up several future research directions.

algorithm, artificial intelligence, machine learning, (17 more...)

arXiv.org Machine Learning

2203.02474

Country: Europe (0.67)

Genre: Research Report > New Finding (0.86)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)

Add feedback