Goto

Collaborating Authors

 Drobyshev, Nikita


KeyFace: Expressive Audio-Driven Facial Animation for Long Sequences via KeyFrame Interpolation

arXiv.org Artificial Intelligence

Current audio-driven facial animation methods achieve impressive results for short videos but suffer from error accumulation and identity drift when extended to longer durations. Existing methods attempt to mitigate this through external spatial control, increasing long-term consistency but compromising the naturalness of motion. We propose KeyFace, a novel two-stage diffusion-based framework, to address these issues. In the first stage, keyframes are generated at a low frame rate, conditioned on audio input and an identity frame, to capture essential facial expressions and movements over extended periods of time. In the second stage, an interpolation model fills in the gaps between keyframes, ensuring smooth transitions and temporal coherence. To further enhance realism, we incorporate continuous emotion representations and handle a wide range of non-speech vocalizations (NSVs), such as laughter and sighs. We also introduce two new evaluation metrics for assessing lip synchronization and NSV generation. Experimental results show that KeyFace outperforms state-of-the-art methods in generating natural, coherent facial animations over extended durations, successfully encompassing NSVs and continuous emotions.


Laughing Matters: Introducing Laughing-Face Generation using Diffusion Models

arXiv.org Artificial Intelligence

Speech-driven animation has gained significant traction in recent years, with current methods achieving near-photorealistic results. However, the field remains underexplored regarding non-verbal communication despite evidence demonstrating its importance in human interaction. In particular, generating laughter sequences presents a unique challenge due to the intricacy and nuances of this behaviour. This paper aims to bridge this gap by proposing a novel model capable of generating realistic laughter sequences, given a still portrait and an audio clip containing laughter. We highlight the failure cases of traditional facial animation methods and leverage recent advances in diffusion models to produce convincing laughter videos. We train our model on a diverse set of laughter datasets and introduce an evaluation metric specifically designed for laughter. When compared with previous speech-driven approaches, our model achieves state-of-the-art performance across all metrics, even when these are re-trained for laughter generation. Our code and project are publicly available


Interpretation of 3D CNNs for Brain MRI Data Classification

arXiv.org Machine Learning

Deep learning shows high potential for many medical image analysis tasks. Neural networks can work with full-size data without extensive preprocessing and feature generation and, thus, information loss. Recent work has shown that the morphological difference in specific brain regions can be found on MRI with the means of Convolution Neural Networks (CNN). However, interpretation of the existing models is based on a region of interest and can not be extended to voxel-wise image interpretation on a whole image. In the current work, we consider the classification task on a large-scale open-source dataset of young healthy subjects -- an exploration of brain differences between men and women. In this paper, we extend the previous findings in gender differences from diffusion-tensor imaging on T1 brain MRI scans. We provide the voxel-wise 3D CNN interpretation comparing the results of three interpretation methods: Meaningful Perturbations, Grad CAM and Guided Backpropagation, and contribute with the open-source library.