Emu Video: Factorizing Text-to-Video Generation by Explicit Image Conditioning

Girdhar, Rohit, Singh, Mannat, Brown, Andrew, Duval, Quentin, Azadi, Samaneh, Rambhatla, Sai Saketh, Shah, Akbar, Yin, Xi, Parikh, Devi, Misra, Ishan

Nov-17-2023–arXiv.org Artificial Intelligence

We present Emu Video, a text-to-video generation model that factorizes the generation into two steps: first generating an image conditioned on the text, and then generating a video conditioned on the text and the generated image. We identify critical design decisions--adjusted noise schedules for diffusion, and multi-stage training--that enable us to directly generate high quality and high resolution videos, without requiring a deep cascade of models as in prior work. In human evaluations, our generated videos are strongly preferred in quality compared to all prior work--81% vs. Google's Imagen Video, 90% vs. Nvidia's PYOCO, and 96% vs. Meta's Make-A-Video. Our model outperforms commercial solutions such as RunwayML's Gen2 and Pika Labs. Finally, our factorizing approach naturally lends itself to animating images based on a user's text prompt, where our generations are preferred 96% over prior work.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

Nov-17-2023

arXiv.org PDF

Add feedback

Country:
- Europe
  - France (0.14)
  - Spain (0.14)
  - United Kingdom > Northern Ireland (0.14)

Genre:
- Research Report (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks (0.70)
  - Natural Language (1.00)
  - Vision (1.00)