FrameBridge: Improving Image-to-Video Generation with Bridge Models

Wang, Yuji, Chen, Zehua, Chen, Xiaoyu, Zhu, Jun, Chen, Jianfei

arXiv.org Artificial Intelligence 

Image-to-video (I2V) generation is gaining increasing attention with its wide application in video synthesis. Recently, diffusion-based I2V models have achieved remarkable progress given their novel design on network architecture, cascaded framework, and motion representation. However, restricted by their noise-to-data generation process, diffusion-based methods inevitably suffer the difficulty to generate video samples with both appearance consistency and temporal coherence from an uninformative Gaussian noise, which may limit their synthesis quality. In this work, we present FrameBridge, taking the given static image as the prior of video target and establishing a tractable bridge model between them. By formulating I2V synthesis as a frames-to-frames generation task and modelling it with a data-to-data process, we fully exploit the information in input image and facilitate the generative model to learn the image animation process. In two popular settings of training I2V models, namely fine-tuning a pre-trained text-to-video (T2V) model or training from scratch, we further propose two techniques, SNR-Aligned Fine-tuning (SAF) and neural prior, which improve the fine-tuning efficiency of diffusion-based T2V models to FrameBridge and the synthesis quality of bridge-based I2V models respectively. Experiments conducted on WebVid-2M and UCF-101 demonstrate that: (1) our FrameBridge achieves superior I2V quality in comparison with the diffusion counterpart (zero-shot FVD 83 vs. 176 on MSR-VTT and non-zero-shot FVD 122 vs. 171 on UCF-101); (2) our proposed SAF and neural prior effectively enhance the ability of bridge-based I2V models in the scenarios of fine-tuning and training from scratch. However, although these methods have demonstrated the potential of diffusion models (Ho et al., 2020; Song et al., 2020) in I2V synthesis, restricted by their noise-to-data generation process, they inevitably suffer the difficulty to generate video samples required by both appearance consistency and temporal coherence from uninformative random noise. With the noise-to-data sampling trajectory which is inherently mismatched with the frame-to-frames synthesis process of I2V task, previous diffusion-based methods increase the burden of generative models, which may result in limited synthesis quality. The sampling process of FrameBridge (upper) starts from the informative given image, while diffusion models (lower) synthesize videos from uninformative noisy representation. In this work, we present FrameBridge, a novel I2V framework to model the frame-to-frames synthesis process with recently proposed data-to-data generative framework (Chen et al.; Liu et al., 2023; Chen et al., 2023c).