Two artists and an advertising company created a deepfake of Facebook founder Mark Zuckerberg saying things he never said, and uploaded it to Instagram. The video, created by artists Bill Posters and Daniel Howe in partnership with advertising company Canny, shows Mark Zuckerberg sitting at a desk, seemingly giving a sinister speech about Facebook's power. The video is framed with broadcast chyrons that say "We're increasing transparency on ads," to make it look like it's part of a news segment. "Imagine this for a second: One man, with total control of billions of people's stolen data, all their secrets, their lives, their futures," Zuckerberg's likeness says, in the video. "I owe it all to Spectre. Spectre showed me that whoever controls the data, controls the future."
University of Washington researchers at the UW Graphics and Image Laboratory have developed new algorithms that turn audio clips into a realistic, lip-synced video, starting with an existing video of that person speaking on a different topic. As detailed in a paper to be presented Aug. 2 at SIGGRAPH 2017, the team successfully generated a highly realistic video of former president Barack Obama talking about terrorism, fatherhood, job creation and other topics, using audio clips of those speeches and existing weekly video addresses in which he originally spoke on a different topic decades ago. Realistic audio-to-video conversion has practical applications like improving video conferencing for meetings (streaming audio over the internet takes up far less bandwidth than video, reducing video glitches), or holding a conversation with a historical figure in virtual reality, said Ira Kemelmacher-Shlizerman, an assistant professor at the UW's Paul G. Allen School of Computer Science & Engineering. This beats previous audio-to-video conversion processes, which have involved filming multiple people in a studio saying the same sentences over and over to try to capture how a particular sound correlates to different mouth shapes, which is expensive, tedious and time-consuming. The new machine learning tool may also help overcome the "uncanny valley" problem, which has dogged efforts to create realistic video from audio.
Video Researchers have crafted algorithms that can take an audio recording of someone talking and map it to a video clip of them speaking to create a new convincing lip-synched video with the replacement sound. In other words, the resulting video carries the injected audio, rather than its original sound, and the frames are manipulated so that the speaker's face and mouth movements match the new audio. You can be forgiven for seeing this as a vital stepping stone to creating the ultimate fake news – highly believable forged videos. Imagine taking a clip of someone important speaking at a private event, and using the aforementioned software to dub in a completely new script, voiced by a skilled impersonator or generated by another AI such as Lyrebird, and then distributing that fraudulent footage. Thankfully, technology is nowhere near that level right now.
We present a novel approach that enables photo-realistic re-animation of portrait videos using only an input video. In contrast to existing approaches that are restricted to manipulations of facial expressions only, we are the first to transfer the full 3D head position, head rotation, face expression, eye gaze, and eye blinking from a source actor to a portrait video of a target actor. The core of our approach is a generative neural network with a novel space-time architecture. The network takes as input synthetic renderings of a parametric face model, based on which it predicts photo-realistic video frames for a given target actor. The realism in this rendering-to-video transfer is achieved by careful adversarial training, and as a result, we can create modified target videos that mimic the behavior of the synthetically-created input. In order to enable source-to-target video re-animation, we render a synthetic target video with the reconstructed head animation parameters from a source video, and feed it into the trained network -- thus taking full control of the target. With the ability to freely recombine source and target parameters, we are able to demonstrate a large variety of video rewrite applications without explicitly modeling hair, body or background. For instance, we can reenact the full head using interactive user-controlled editing, and realize high-fidelity visual dubbing. To demonstrate the high quality of our output, we conduct an extensive series of experiments and evaluations, where for instance a user study shows that our video edits are hard to detect.
Synthesizing Obama: Learning Lip Sync from Audio Supasorn Suwajanakorn, Steven M. Seitz, Ira Kemelmacher-Shlizerman SIGGRAPH 2017 Given audio of President Barack Obama, we synthesize a high quality video of him speaking with accurate lip sync, composited into a target video clip. Trained on many hours of his weekly address footage, a recurrent neural network learns the mapping from raw audio features to mouth shapes. Given the mouth shape at each time instant, we synthesize high quality mouth texture, and composite it with proper 3D pose matching to change what he appears to be saying in a target video to match the input audio track http://grail.cs.washington.edu/projec...