Plotting

 Tahon, Marie


Predefined Prototypes for Intra-Class Separation and Disentanglement

arXiv.org Artificial Intelligence

It is possible to associate some concrete dimensions of these representations with concrete human-understandable features Prototypical Learning is based on the idea that there is a point so that a change of a feature produces changes in only a few (which we call prototype) around which the embeddings of a dimensions of the space. This is has some advantages such as class are clustered. It has shown promising results in scenarios (i) having more control over data creation in generative models with little labeled data or to design explainable models. Typically, [8], or (ii) providing the ability to explain and interpret prototypes are either defined as the average of the embeddings model predictions [9]. of a class or are designed to be trainable. In this work, In this paper we propose a modification on the prototypical we propose to predefine prototypes following human-specified systems that preserves their default advantages and, in addition, criteria, which simplify the training pipeline and brings different allows solving the two problems presented.


A Semi-Automatic Approach to Create Large Gender- and Age-Balanced Speaker Corpora: Usefulness of Speaker Diarization & Identification

arXiv.org Artificial Intelligence

This paper presents a semi-automatic approach to create a diachronic corpus of voices balanced for speaker's age, gender, and recording period, according to 32 categories (2 genders, 4 age ranges and 4 recording periods). Corpora were selected at French National Institute of Audiovisual (INA) to obtain at least 30 speakers per category (a total of 960 speakers; only 874 have be found yet). For each speaker, speech excerpts were extracted from audiovisual documents using an automatic pipeline consisting of speech detection, background music and overlapped speech removal and speaker diarization, used to present clean speaker segments to human annotators identifying target speakers. This pipeline proved highly effective, cutting down manual processing by a factor of ten. Evaluation of the quality of the automatic processing and of the final output is provided. It shows the automatic processing compare to up-to-date process, and that the output provides high quality speech for most of the selected excerpts. This method shows promise for creating large corpora of known target speakers.


Unsupervised Multiple Domain Translation through Controlled Disentanglement in Variational Autoencoder

arXiv.org Artificial Intelligence

Unsupervised Multiple Domain Translation is the task of transforming data from one domain to other domains without having paired data to train the systems. Typically, methods based on Generative Adversarial Networks (GANs) are used to address this task. However, our proposal exclusively relies on a modified version of a Variational Autoencoder. This modification consists of the use of two latent variables disentangled in a controlled way by design. One of this latent variables is imposed to depend exclusively on the domain, while the other one must depend on the rest of the variability factors of the data. Additionally, the conditions imposed over the domain latent variable allow for better control and understanding of the latent space. We empirically demonstrate that our approach works on different vision datasets improving the performance of other well known methods. Finally, we prove that, indeed, one of the latent variables stores all the information related to the domain and the other one hardly contains any domain information.


An Explainable Proxy Model for Multiabel Audio Segmentation

arXiv.org Artificial Intelligence

Audio signal segmentation is a key task for automatic audio indexing. It consists of detecting the boundaries of class-homogeneous segments in the signal. In many applications, explainable AI is a vital process for transparency of decision-making with machine learning. In this paper, we propose an explainable multilabel segmentation model that solves speech activity (SAD), music (MD), noise (ND), and overlapped speech detection (OSD) simultaneously. This proxy uses the non-negative matrix factorization (NMF) to map the embedding used for the segmentation to the frequency domain. Experiments conducted on two datasets show similar performances as the pre-trained black box model while showing strong explainability features. Specifically, the frequency bins used for the decision can be easily identified at both the segment level (local explanations) and global level (class prototypes).


Acoustic and linguistic representations for speech continuous emotion recognition in call center conversations

arXiv.org Artificial Intelligence

The goal of our research is to automatically retrieve the satisfaction and the frustration in real-life call-center conversations. This study focuses an industrial application in which the customer satisfaction is continuously tracked down to improve customer services. To compensate the lack of large annotated emotional databases, we explore the use of pre-trained speech representations as a form of transfer learning towards AlloSat corpus. Moreover, several studies have pointed out that emotion can be detected not only in speech but also in facial trait, in biological response or in textual information. In the context of telephone conversations, we can break down the audio information into acoustic and linguistic by using the speech signal and its transcription. Our experiments confirms the large gain in performance obtained with the use of pre-trained features. Surprisingly, we found that the linguistic content is clearly the major contributor for the prediction of satisfaction and best generalizes to unseen data. Our experiments conclude to the definitive advantage of using CamemBERT representations, however the benefit of the fusion of acoustic and linguistic modalities is not as obvious. With models learnt on individual annotations, we found that fusion approaches are more robust to the subjectivity of the annotation task. This study also tackles the problem of performances variability and intends to estimate this variability from different views: weights initialization, confidence intervals and annotation subjectivity. A deep analysis on the linguistic content investigates interpretable factors able to explain the high contribution of the linguistic modality for this task.


Joint speech and overlap detection: a benchmark over multiple audio setup and speech domains

arXiv.org Artificial Intelligence

VAD and OSD) are key pre-processing tasks for speaker diarization. In this paper, we propose two 2-class VAD and OSD and 3-The final segmentation performance highly relies on class VAD+OSD for mono and multi-channel signals. We evaluate the robustness of these sub-tasks. Recent studies have shown how beneficial is the 3-class approach in comparison to the VAD and OSD can be trained jointly using a multi-class classification use of two independent VAD and OSD models in terms of F1-model. However, these works are often restricted to a score and training resources. Each system is trained and evaluated specific speech domain, lacking information about the generalization on four different datasets covering various speech domains capacities of the systems. This paper proposes a complete including both single and multiple microphone scenarios. To and new benchmark of different VAD and OSD models, the best of our knowledge, no benchmark has been conducted on multiple audio setups (single/multi-channel) and speech domains on these approaches across various speech domains and recording (e.g.


Evaluation of Speaker Anonymization on Emotional Speech

arXiv.org Artificial Intelligence

Speech data carries a range of personal information, such as the speaker's identity and emotional state. These attributes can be used for malicious purposes. With the development of virtual assistants, a new generation of privacy threats has emerged. Current studies have addressed the topic of preserving speech privacy. One of them, the VoicePrivacy initiative aims to promote the development of privacy preservation tools for speech technology. The task selected for the VoicePrivacy 2020 Challenge (VPC) is about speaker anonymization. The goal is to hide the source speaker's identity while preserving the linguistic information. The baseline of the VPC makes use of a voice conversion. This paper studies the impact of the speaker anonymization baseline system of the VPC on emotional information present in speech utterances. Evaluation is performed following the VPC rules regarding the attackers' knowledge about the anonymization system. Our results show that the VPC baseline system does not suppress speakers' emotions against informed attackers. When comparing anonymized speech to original speech, the emotion recognition performance is degraded by 15\% relative to IEMOCAP data, similar to the degradation observed for automatic speech recognition used to evaluate the preservation of the linguistic information.