Mordan, Taylor
Toward Reliable Human Pose Forecasting with Uncertainty
Saadatnejad, Saeed, Mirmohammadi, Mehrshad, Daghyani, Matin, Saremi, Parham, Benisi, Yashar Zoroofchi, Alimohammadi, Amirhossein, Tehraninasab, Zahra, Mordan, Taylor, Alahi, Alexandre
Recently, there has been an arms race of pose forecasting methods aimed at solving the spatio-temporal task of predicting a sequence of future 3D poses of a person given a sequence of past observed ones. However, the lack of unified benchmarks and limited uncertainty analysis have hindered progress in the field. To address this, we first develop an open-source library for human pose forecasting, featuring multiple models, datasets, and standardized evaluation metrics, with the aim of promoting research and moving toward a unified and fair evaluation. Second, we devise two types of uncertainty in the problem to increase performance and convey better trust: 1) we propose a method for modeling Figure 1: We propose to model two kinds of uncertainty: aleatoric uncertainty by using uncertainty priors to inject 1) aleatoric uncertainty, learned by our model to capture knowledge about the behavior of uncertainty. This focuses the temporal evolution of uncertainty, which becomes more the capacity of the model in the direction of more meaningful prominent over time, as depicted by the lighter colors and supervision while reducing the number of learned parameters thicker bones for the right person; 2) epistemic uncertainty and improving stability; 2) we introduce a novel to detect out-of-distribution forecast poses coming from unseen approach for quantifying the epistemic uncertainty of any scenarios in training, such as for the left person.
A generic diffusion-based approach for 3D human pose prediction in the wild
Saadatnejad, Saeed, Rasekh, Ali, Mofayezi, Mohammadreza, Medghalchi, Yasamin, Rajabzadeh, Sara, Mordan, Taylor, Alahi, Alexandre
Predicting 3D human poses in real-world scenarios, also known as human pose forecasting, is inevitably subject to noisy inputs arising from inaccurate 3D pose estimations and occlusions. To address these challenges, we propose a diffusion-based approach that can predict given noisy observations. We frame the prediction task as a denoising problem, where both observation and prediction are considered as a single sequence containing missing elements (whether in the observation or prediction horizon). All missing elements are treated as noise and denoised with our conditional diffusion model. To better handle long-term forecasting horizon, we present a temporal cascaded diffusion model. We demonstrate the benefits of our approach on four publicly available datasets (Human3.6M, HumanEva-I, AMASS, and 3DPW), outperforming the state-of-the-art. Additionally, we show that our framework is generic enough to improve any 3D pose prediction model as a pre-processing step to repair their inputs and a post-processing step to refine their outputs. The code is available online: \url{https://github.com/vita-epfl/DePOSit}.
Revisiting Multi-Task Learning with ROCK: a Deep Residual Auxiliary Block for Visual Detection
Mordan, Taylor, THOME, Nicolas, Henaff, Gilles, Cord, Matthieu
Multi-Task Learning (MTL) is appealing for deep learning regularization. In this paper, we tackle a specific MTL context denoted as primary MTL, where the ultimate goal is to improve the performance of a given primary task by leveraging several other auxiliary tasks. Our main methodological contribution is to introduce ROCK, a new generic multi-modal fusion block for deep learning tailored to the primary MTL context. ROCK architecture is based on a residual connection, which makes forward prediction explicitly impacted by the intermediate auxiliary representations. The auxiliary predictor's architecture is also specifically designed to our primary MTL context, by incorporating intensive pooling operators for maximizing complementarity of intermediate representations. Extensive experiments on NYUv2 dataset (object detection with scene classification, depth prediction, and surface normal estimation as auxiliary tasks) validate the relevance of the approach and its superiority to flat MTL approaches. Our method outperforms state-of-the-art object detection models on NYUv2 dataset by a large margin, and is also able to handle large-scale heterogeneous inputs (real and synthetic images) with missing annotation modalities.
Revisiting Multi-Task Learning with ROCK: a Deep Residual Auxiliary Block for Visual Detection
Mordan, Taylor, THOME, Nicolas, Henaff, Gilles, Cord, Matthieu
Multi-Task Learning (MTL) is appealing for deep learning regularization. In this paper, we tackle a specific MTL context denoted as primary MTL, where the ultimate goal is to improve the performance of a given primary task by leveraging several other auxiliary tasks. Our main methodological contribution is to introduce ROCK, a new generic multi-modal fusion block for deep learning tailored to the primary MTL context. ROCK architecture is based on a residual connection, which makes forward prediction explicitly impacted by the intermediate auxiliary representations. The auxiliary predictor's architecture is also specifically designed to our primary MTL context, by incorporating intensive pooling operators for maximizing complementarity of intermediate representations. Extensive experiments on NYUv2 dataset (object detection with scene classification, depth prediction, and surface normal estimation as auxiliary tasks) validate the relevance of the approach and its superiority to flat MTL approaches. Our method outperforms state-of-the-art object detection models on NYUv2 dataset by a large margin, and is also able to handle large-scale heterogeneous inputs (real and synthetic images) with missing annotation modalities.