VideoPoet: A Large Language Model for Zero-Shot Video Generation

Kondratyuk, Dan, Yu, Lijun, Gu, Xiuye, Lezama, José, Huang, Jonathan, Hornung, Rachel, Adam, Hartwig, Akbari, Hassan, Alon, Yair, Birodkar, Vighnesh, Cheng, Yong, Chiu, Ming-Chang, Dillon, Josh, Essa, Irfan, Gupta, Agrim, Hahn, Meera, Hauth, Anja, Hendon, David, Martinez, Alonso, Minnen, David, Ross, David, Schindler, Grant, Sirotenko, Mikhail, Sohn, Kihyuk, Somandepalli, Krishna, Wang, Huisheng, Yan, Jimmy, Yang, Ming-Hsuan, Yang, Xuan, Seybold, Bryan, Jiang, Lu

Dec-21-2023–arXiv.org Artificial Intelligence

We present VideoPoet, a language model capable of synthesizing high-quality video, with matching audio, from a large variety of conditioning signals. VideoPoet employs a decoder-only transformer architecture that processes multimodal inputs -- including images, videos, text, and audio. The training protocol follows that of Large Language Models (LLMs), consisting of two stages: pretraining and task-specific adaptation. During pretraining, VideoPoet incorporates a mixture of multimodal generative objectives within an autoregressive Transformer framework. The pretrained LLM serves as a foundation that can be adapted for a range of video generation tasks. We present empirical results demonstrating the model's state-of-the-art capabilities in zero-shot video generation, specifically highlighting VideoPoet's ability to generate high-fidelity motions. Project page: http://sites.research.google/videopoet/

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

Dec-21-2023

arXiv.org PDF

Add feedback

Country:
- North America > United States (0.14)

Genre:
- Research Report > New Finding (1.00)

Industry:
- Media (0.47)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)
  - Natural Language > Large Language Model (1.00)