Empowering Dynamics-aware Text-to-Video Diffusion with Large Language Models

Fei, Hao, Wu, Shengqiong, Ji, Wei, Zhang, Hanwang, Chua, Tat-Seng

arXiv.org Artificial Intelligence 

Text-to-video (T2V) synthesis has gained increasing attention in the community, in which the recently emerged diffusion models (DMs) have promisingly shown stronger performance than the past approaches. While existing state-of-the-art DMs are competent to achieve high-resolution video generation, they may largely suffer from key limitations (e.g., action occurrence disorders, crude video motions) with respect to the intricate temporal dynamics modeling, one of the crux of video synthesis. In this work, we investigate strengthening the awareness of video dynamics for DMs, for high-quality T2V generation. Inspired by human intuition, we design an innovative dynamic scene manager (dubbed as Dysen) module, which includes (step-1) extracting from input text the key actions with proper timeorder arrangement, (step-2) transforming the action schedules into the dynamic scene graph (DSG) representations, and (step-3) enriching the scenes in the DSG with sufficient and reasonable details. Taking advantage of the existing powerful LLMs (e.g., ChatGPT) via in-context learning, Dysen realizes (nearly) human-level temporal dynamics understanding. Finally, the resulting video DSG with rich action scene details is encoded as fine-grained spatio-temporal features, integrated into the backbone T2V DM for video generating. Experiments on popular T2V datasets suggest that our framework consistently outperforms prior arts with significant margins, especially in the scenario with complex actions. Recently, the theme of AI-Generated Content (AIGC) has witnessed thrilling advancements and remarkable progress, e.g., ChatGPT (Ouyang et al., 2022), DELLE-2 (Ramesh et al., 2022) and Stable Diffusion (SD) (Rombach et al., 2022b). As one of the generative topics, text-to-video synthesis that generates video content complying with the provided textual description has received an increasing number of attention in the community. More recently, diffusion models have emerged to provide a new paradigm of T2V. Compared with previous models, DMs advance in superior generation quality and scaling capability to large datasets (Harvey et al., 2022; Höppe et al., 2022), and thus showing great potential on this track (Mei & Patel, 2022; Luo et al., 2023; Yu et al., 2023; Ni et al., 2023). While the latest DMbased T2V explorations paid much effort into enhancing the quality of video frames, i.e., generating The dynamic scene manager (Dysen) module operates over the input text prompt and produces the enriched dynamic scene graph (DSG), which is encoded by the recurrent graph Transformer (RGTrm), and the resulting fine-grained spatio-temporal scene features are integrated into the video generation (denoising) process.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found