Emu Video: Factorizing Text-to-Video Generation by Explicit Image Conditioning
Girdhar, Rohit, Singh, Mannat, Brown, Andrew, Duval, Quentin, Azadi, Samaneh, Rambhatla, Sai Saketh, Shah, Akbar, Yin, Xi, Parikh, Devi, Misra, Ishan
–arXiv.org Artificial Intelligence
We present Emu Video, a text-to-video generation model that factorizes the generation into two steps: first generating an image conditioned on the text, and then generating a video conditioned on the text and the generated image. We identify critical design decisions--adjusted noise schedules for diffusion, and multi-stage training--that enable us to directly generate high quality and high resolution videos, without requiring a deep cascade of models as in prior work. In human evaluations, our generated videos are strongly preferred in quality compared to all prior work--81% vs. Google's Imagen Video, 90% vs. Nvidia's PYOCO, and 96% vs. Meta's Make-A-Video. Our model outperforms commercial solutions such as RunwayML's Gen2 and Pika Labs. Finally, our factorizing approach naturally lends itself to animating images based on a user's text prompt, where our generations are preferred 96% over prior work.
arXiv.org Artificial Intelligence
Nov-17-2023
- Country:
- Europe
- France (0.14)
- Spain (0.14)
- United Kingdom > Northern Ireland (0.14)
- Europe
- Genre:
- Research Report (1.00)
- Technology:
- Information Technology > Artificial Intelligence
- Machine Learning > Neural Networks (0.70)
- Natural Language (1.00)
- Vision (1.00)
- Information Technology > Artificial Intelligence