HARIVO: Harnessing Text-to-Image Models for Video Generation

Kwon, Mingi, Oh, Seoung Wug, Zhou, Yang, Liu, Difan, Lee, Joon-Young, Cai, Haoran, Liu, Baqiao, Liu, Feng, Uh, Youngjung

Oct-10-2024–arXiv.org Artificial Intelligence

We present a method to create diffusion-based video models from pretrained Text-to-Image (T2I) models. Recently, AnimateDiff proposed freezing the T2I model while only training temporal layers. We advance this method by proposing a unique architecture, incorporating a mapping network and frame-wise tokens, tailored for video generation while maintaining the diversity and creativity of the original T2I model. Key innovations include novel loss functions for temporal smoothness and a mitigating gradient sampling technique, ensuring realistic and temporally consistent video generation despite limited public video data. We have successfully integrated video-specific inductive biases into the architecture and loss functions. Our method, built on the frozen StableDiffusion model, simplifies training processes and allows for seamless integration with off-the-shelf models like ControlNet and DreamBooth.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

Oct-10-2024

arXiv.org PDF

Add feedback

Country:
- North America > United States > Illinois (0.14)

Genre:
- Research Report (1.00)

Industry:
- Leisure & Entertainment > Sports (0.46)
- Media (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning (1.00)
  - Natural Language (0.93)
  - Representation & Reasoning (0.93)
  - Vision (1.00)