phenaki
Deepfakes could get super advanced (and weird) thanks to these breakthroughs
If you picture a cape-clad dog soaring through the clouds or an astronaut riding a horse on Mars, you may think you're experiencing a fever dream. But these surreal images exist outside of a sleepy daze: You can pull them up on your computer right now. They were created by Meta's top-of-the-line algorithms that can turn any text into a (somewhat) realistic video. Last month, Meta used these surreal clips to introduce its Make-A-Video AI text-to-video generator to the world. Just days later, Google showed off not one but two AI video generators: Imagen Video and Phenaki. These models were designed to transform text descriptions into short video clips.
Phenaki
We present Phenaki, a model capable of realistic video synthesis given a sequence of textual prompts. Generating videos from text is particularly challenging due to the computational cost, limited quantities of high quality text-video data and variable length of videos. To address these issues, we introduce a new causal model for learning video representation which compresses the video to a small representation of discrete tokens. This tokenizer uses causal attention in time, which allows it to work with variable-length videos. To generate video tokens from text we are using a bidirectional masked transformer conditioned on pre-computed text tokens.
Meet Phenaki: A Machine Learning-Based Model For Generating Videos From Text Prompts And Uses C-ViViT As Video Encoder
Text-to-image generation is a hot topic in the AI domain, mainly thanks to the open-source release of stable-diffusion. Do you want to see an image of "a teddy bear sleeping in a medieval bed drawn in Van Gogh style"? You can pass a prompt with details, and the stable-diffusion AI will generate a realistic image for you. The X-to-Y generation madness using diffusion models is not just limited to images. You can go from text-to-image, text-to-speech, image-to-image, and the list goes on.
La veille de la cybersรฉcuritรฉ
Artificial intelligence is quickly advancing in the field of video generation. That could have a profound effect on our social media feeds one day. AI's creative abilities are outstripping its driving skills. While self-driving car technology is going nowhere, there's been a remarkable explosion in research around generative models, or artificial intelligence systems that can create images from simple text. In just the past week, AI researchers from Meta Platforms Inc. and Alphabet Inc.'s Google have taken an extraordinary leap forward, developing systems that can generate videos with just about any text prompt one can imagine.
Phenaki: Variable Length Video Generation From Open Domain Textual Description
Villegas, Ruben, Babaeizadeh, Mohammad, Kindermans, Pieter-Jan, Moraldo, Hernan, Zhang, Han, Saffar, Mohammad Taghi, Castro, Santiago, Kunze, Julius, Erhan, Dumitru
We present Phenaki, a model capable of realistic video synthesis, given a sequence of textual prompts. Generating videos from text is particularly challenging due to the computational cost, limited quantities of high quality text-video data and variable length of videos. To address these issues, we introduce a new model for learning video representation which compresses the video to a small representation of discrete tokens. This tokenizer uses causal attention in time, which allows it to work with variable-length videos. To generate video tokens from text we are using a bidirectional masked transformer conditioned on pre-computed text tokens. The generated video tokens are subsequently de-tokenized to create the actual video. To address data issues, we demonstrate how joint training on a large corpus of image-text pairs as well as a smaller number of video-text examples can result in generalization beyond what is available in the video datasets. Compared to the previous video generation methods, Phenaki can generate arbitrary long videos conditioned on a sequence of prompts (i.e. To the best of our knowledge, this is the first time a paper studies generating videos from time variable prompts. In addition, compared to the perframe baselines, the proposed video encoder-decoder computes fewer tokens per video but results in better spatio-temporal consistency. It is now possible to generate realistic high resolution images given a description [34, 35, 32, 38, 59], but generating high quality videos from text remains challenging.