Audio-Driven Emotional 3D Talking-Head Generation
–arXiv.org Artificial Intelligence
Audio-driven video portrait synthesis is a crucial and useful technology in virtual human interaction and film-making applications. Recent advancements have focused on improving the image fidelity and lip-synchronization. However, generating accurate emotional expressions is an important aspect of realistic talking-head generation, which has remained underexplored in previous works. We present a novel system in this paper for synthesizing high-fidelity, audio-driven video portraits with accurate emotional expressions. Specifically, we utilize a variational autoencoder (VAE)-based audio-to-motion module to generate facial landmarks. These landmarks are concatenated with emotional embeddings to produce emotional landmarks through our motion-to-emotion module. These emotional landmarks are then used to render realistic emotional talking-head video using a Neural Radiance Fields (NeRF)-based emotion-to-video module. Additionally, we propose a pose sampling method that generates natural idle-state (non-speaking) videos in response to silent audio inputs. Extensive experiments demonstrate that our method obtains more accurate emotion generation with higher fidelity.
arXiv.org Artificial Intelligence
Oct-7-2024
- Genre:
- Research Report (0.50)
- Technology:
- Information Technology > Artificial Intelligence
- Machine Learning > Neural Networks (1.00)
- Natural Language > Chatbot (0.96)
- Vision (1.00)
- Information Technology > Artificial Intelligence