Speech Driven Video Editing via an Audio-Conditioned Diffusion Model

Bigioi, Dan, Basak, Shubhajit, Stypułkowski, Michał, Zięba, Maciej, Jordan, Hugh, McDonnell, Rachel, Corcoran, Peter

May-11-2023–arXiv.org Artificial Intelligence

Taking inspiration from recent developments in visual generative tasks using diffusion models, we propose a method for end-to-end speech-driven video editing using a denoising diffusion model. Given a video of a talking person, and a separate auditory speech recording, the lip and jaw motions are re-synchronized without relying on intermediate structural representations such as facial landmarks or a 3D face model. We show this is possible by conditioning a denoising diffusion model on audio mel spectral features to generate synchronised facial motion. Proof of concept results are demonstrated on both single-speaker and multi-speaker video editing, providing a baseline model on the CREMA-D audiovisual data set. To the best of our knowledge, this is the first work to demonstrate and validate the feasibility of applying end-to-end denoising diffusion models to the task of audio-driven video editing.

artificial intelligence, audio-conditioned diffusion model, speech driven video editing

arXiv.org Artificial Intelligence

May-11-2023

arXiv.org PDF

Add feedback

Genre:
- Research Report (0.69)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning (1.00)
  - Vision > Face Recognition (0.53)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found