ced63a669e5f3e6fd6dad3a0fd8f3567-Paper-Conference.pdf

Neural Information Processing Systems 

Recent advances in foundational video generators (33; 38; 28), particularly through large diffusion transformers, have significantly improved video generation capabilities. This progress naturally suggests leveraging these powerful foundation models to advance video inpainting and editing. However, effectively utilizing their conditional generation abilities for these tasks would typically demand substantial computational resources for training, given their massive scale. Furthermore, as foundation models continue to evolve, traditional approaches relying on extensive fine-tuning will face increasing challenges in adapting to new video generators. An alternative solution is to employ these video generators as data priors, enabling task resolution in a training-free manner. Recent research has extensively explored methods for enabling conditional generation in image diffusion models.