Alameda-Pineda, Xavier
AnCoGen: Analysis, Control and Generation of Speech with a Masked Autoencoder
Sadok, Samir, Leglaive, Simon, Girin, Laurent, Richard, Gaël, Alameda-Pineda, Xavier
Abstract--This article introduces AnCoGen, a novel method that leverages a masked autoencoder to unify the analysis, control, and generation of speech signals within a single model. AnCoGen can analyze speech by estimating key attributes, such as speaker identity, pitch, content, loudness, signal-to-noise ratio, and clarity index. In addition, it can generate speech from these attributes and allow precise control of the synthesized speech by modifying them. Extensive experiments demonstrated the effectiveness of AnCoGen across speech analysisresynthesis, pitch estimation, pitch modification, and speech enhancement. Over the years, many speech processing algorithms have been In this paper, we introduce AnCoGen for analyzing, controlling, developed to analyze, transform, and synthesize speech signals.
Diffusion-based Unsupervised Audio-visual Speech Enhancement
Ayilo, Jean-Eudes, Sadeghi, Mostafa, Serizel, Romain, Alameda-Pineda, Xavier
This paper proposes a new unsupervised audiovisual speech enhancement (AVSE) approach that combines a diffusion-based audio-visual speech generative model with a non-negative matrix factorization (NMF) noise model. First, the diffusion model is pre-trained on clean speech conditioned on corresponding video data to simulate the speech generative distribution. This pre-trained model is then paired with the NMF-based noise model to iteratively estimate clean speech. Specifically, a diffusion-based posterior sampling approach is implemented within the reverse diffusion process, where after each iteration, a speech estimate is obtained and used to update the noise parameters. Experimental results confirm that the proposed AVSE approach not only outperforms its audio-only counterpart but also generalizes better than a recent supervisedgenerative AVSE method. Additionally, the new inference algorithm offers a better balance between inference speed and performance compared to the previous diffusion-based method.
Socially Pertinent Robots in Gerontological Healthcare
Alameda-Pineda, Xavier, Addlesee, Angus, García, Daniel Hernández, Reinke, Chris, Arias, Soraya, Arrigoni, Federica, Auternaud, Alex, Blavette, Lauriane, Beyan, Cigdem, Camara, Luis Gomez, Cohen, Ohad, Conti, Alessandro, Dacunha, Sébastien, Dondrup, Christian, Ellinson, Yoav, Ferro, Francesco, Gannot, Sharon, Gras, Florian, Gunson, Nancie, Horaud, Radu, D'Incà, Moreno, Kimouche, Imad, Lemaignan, Séverin, Lemon, Oliver, Liotard, Cyril, Marchionni, Luca, Moradi, Mordehay, Pajdla, Tomas, Pino, Maribel, Polic, Michal, Py, Matthieu, Rado, Ariel, Ren, Bin, Ricci, Elisa, Rigaud, Anne-Sophie, Rota, Paolo, Romeo, Marta, Sebe, Nicu, Sieińska, Weronika, Tandeitnik, Pinchas, Tonini, Francesco, Turro, Nicolas, Wintz, Timothée, Yu, Yanchao
Despite the many recent achievements in developing and deploying social robotics, there are still many underexplored environments and applications for which systematic evaluation of such systems by end-users is necessary. While several robotic platforms have been used in gerontological healthcare, the question of whether or not a social interactive robot with multi-modal conversational capabilities will be useful and accepted in real-life facilities is yet to be answered. This paper is an attempt to partially answer this question, via two waves of experiments with patients and companions in a day-care gerontological facility in Paris with a full-sized humanoid robot endowed with social and conversational interaction capabilities. The software architecture, developed during the H2020 SPRING project, together with the experimental protocol, allowed us to evaluate the acceptability (AES) and usability (SUS) with more than 60 end-users. Overall, the users are receptive to this technology, especially when the robot perception and action skills are robust to environmental clutter and flexible to handle a plethora of different interactions.
Univariate Radial Basis Function Layers: Brain-inspired Deep Neural Layers for Low-Dimensional Inputs
Jost, Daniel, Patil, Basavasagar, Alameda-Pineda, Xavier, Reinke, Chris
Deep Neural Networks (DNNs) became the standard tool for function approximation with most of the introduced architectures being developed for high-dimensional input data. However, many real-world problems have low-dimensional inputs for which standard Multi-Layer Perceptrons (MLPs) are the default choice. An investigation into specialized architectures is missing. We propose a novel DNN layer called Univariate Radial Basis Function (U-RBF) layer as an alternative. Similar to sensory neurons in the brain, the U-RBF layer processes each individual input dimension with a population of neurons whose activations depend on different preferred input values. We verify its effectiveness compared to MLPs in low-dimensional function regressions and reinforcement learning tasks. The results show that the U-RBF is especially advantageous when the target function becomes complex and difficult to approximate.
Mixture of Dynamical Variational Autoencoders for Multi-Source Trajectory Modeling and Separation
Lin, Xiaoyu, Girin, Laurent, Alameda-Pineda, Xavier
In this paper, we propose a latent-variable generative model called mixture of dynamical variational autoencoders (MixDV AE) to model the dynamics of a system composed of multiple moving sources. A DV AE model is pre-trained on a single-source dataset to capture the source dynamics. Then, multiple instances of the pre-trained DV AE model are integrated into a multi-source mixture model with a discrete observation-to-source assignment latent variable. The posterior distributions of both the discrete observation-to-source assignment variable and the continuous DV AE variables representing the sources content/position are estimated using a variational expectation-maximization algorithm, leading to multi-source trajectories estimation. We illustrate the versatility of the proposed MixDV AE model on two tasks: a computer vision task, namely multi-object tracking, and an audio processing task, namely single-channel audio source separation. Experimental results show that the proposed method works well on these two tasks, and outperforms several baseline methods.
Motion-DVAE: Unsupervised learning for fast human motion denoising
Fiche, Guénolé, Leglaive, Simon, Alameda-Pineda, Xavier, Séguier, Renaud
Pose and motion priors are crucial for recovering realistic and accurate human motion from noisy observations. Substantial progress has been made on pose and shape estimation from images, and recent works showed impressive results using priors to refine frame-wise predictions. However, a lot of motion priors only model transitions between consecutive poses and are used in time-consuming optimization procedures, which is problematic for many applications requiring real-time motion capture. We introduce Motion-DVAE, a motion prior to capture the short-term dependencies of human motion. As part of the dynamical variational autoencoder (DVAE) models family, Motion-DVAE combines the generative capability of VAE models and the temporal modeling of recurrent architectures. Together with Motion-DVAE, we introduce an unsupervised learned denoising method unifying regression- and optimization-based approaches in a single framework for real-time 3D human pose estimation. Experiments show that the proposed approach reaches competitive performance with state-of-the-art methods while being much faster.
A weighted-variance variational autoencoder model for speech enhancement
Golmakani, Ali, Sadeghi, Mostafa, Alameda-Pineda, Xavier, Serizel, Romain
We address speech enhancement based on variational autoencoders, which involves learning a speech prior distribution in the time-frequency (TF) domain. A zero-mean complex-valued Gaussian distribution is usually assumed for the generative model, where the speech information is encoded in the variance as a function of a latent variable. In contrast to this commonly used approach, we propose a weighted variance generative model, where the contribution of each spectrogram time-frame in parameter learning is weighted. We impose a Gamma prior distribution on the weights, which would effectively lead to a Student's t-distribution instead of Gaussian for speech generative modeling. We develop efficient training and speech enhancement algorithms based on the proposed generative model. Our experimental results on spectrogram auto-encoding and speech enhancement demonstrate the effectiveness and robustness of the proposed approach compared to the standard unweighted variance model.
Variational Meta Reinforcement Learning for Social Robotics
Ballou, Anand, Alameda-Pineda, Xavier, Reinke, Chris
With the increasing presence of robots in our every-day environments, improving their social skills is of utmost importance. Nonetheless, social robotics still faces many challenges. One bottleneck is that robotic behaviors need to be often adapted as social norms depend strongly on the environment. For example, a robot should navigate more carefully around patients in a hospital compared to workers in an office. In this work, we investigate meta-reinforcement learning (meta-RL) as a potential solution. Here, robot behaviors are learned via reinforcement learning where a reward function needs to be chosen so that the robot learns an appropriate behavior for a given environment. We propose to use a variational meta-RL procedure that quickly adapts the robots' behavior to new reward functions. As a result, given a new environment different reward functions can be quickly evaluated and an appropriate one selected. The procedure learns a vectorized representation for reward functions and a meta-policy that can be conditioned on such a representation. Given observations from a new reward function, the procedure identifies its representation and conditions the meta-policy to it. While investigating the procedures' capabilities, we realized that it suffers from posterior collapse where only a subset of the dimensions in the representation encode useful information resulting in a reduced performance. Our second contribution, a radial basis function (RBF) layer, partially mitigates this negative effect. The RBF layer lifts the representation to a higher dimensional space, which is more easily exploitable for the meta-policy. We demonstrate the interest of the RBF layer and the usage of meta-RL for social robotics on four robotic simulation tasks.
A Comprehensive Multi-scale Approach for Speech and Dynamics Synchrony in Talking Head Generation
Airale, Louis, Vaufreydaz, Dominique, Alameda-Pineda, Xavier
Animating still face images with deep generative models using a speech input signal is an active research topic and has seen important recent progress. However, much of the effort has been put into lip syncing and rendering quality while the generation of natural head motion, let alone the audio-visual correlation between head motion and speech, has often been neglected. In this work, we propose a multi-scale audio-visual synchrony loss and a multi-scale autoregressive GAN to better handle short and long-term correlation between speech and the dynamics of the head and lips. In particular, we train a stack of syncer models on multimodal input pyramids and use these models as guidance in a multi-scale generator network to produce audio-aligned motion unfolding over diverse time scales. Our generator operates in the facial landmark domain, which is a standard low-dimensional head representation. The experiments show significant improvements over the state of the art in head motion dynamics quality and in multi-scale audio-visual synchrony both in the landmark domain and in the image domain.
Unsupervised speech enhancement with deep dynamical generative speech and noise models
Lin, Xiaoyu, Leglaive, Simon, Girin, Laurent, Alameda-Pineda, Xavier
ND methods use noise or noisy speech enhancement using a dynamical variational autoencoder speech training samples to learn some noise characteristics. In (DVAE) as the clean speech model and non-negative matrix factorization contrast, NA methods only use clean speech signals for training (NMF) as the noise model. We propose to replace and the noise characteristics are estimated at test time for the NMF noise model with a deep dynamical generative model each noisy speech sequence to process. A typical unsupervised (DDGM) depending either on the DVAE latent variables, or on NA approach uses a pre-trained variational autoencoder (VAE) the noisy observations, or on both. This DDGM can be trained as a prior distribution of the clean speech signal and a nonnegative in three configurations: noise-agnostic, noise-dependent and matrix factorization (NMF) model for the noise variance noise adaptation after noise-dependent training.