MotionMatcher: Motion Customization of Text-to-Video Diffusion Models via Motion Feature Matching

Wu, Yen-Siang, Huang, Chi-Pin, Yang, Fu-En, Wang, Yu-Chiang Frank

arXiv.org Artificial Intelligence 

Similarly, to control the pacing and flow of AI-generated videos, users should have control over the dynamics and composition of videos produced by generative models. To this end, numerous motion control methods [25, 33, 57, 59, 61, 63, 72] have been proposed to control moving object trajectories in videos generated by text-to-video (T2V) diffusion models [4, 17]. Motion customization, in particular, aims to control T2V diffusion models with the motion of a reference video [26, 31, 36, 71, 76]. With the assistance of the reference video, users are able to specify the desired object movements and camera framing in detail. Formally speaking, given a reference video, motion customization aims to adjust a pre-trained T2V diffusion model, so the output videos sampled from the adjusted model follow the object movements and camera framing of the reference video (see Figure 1 for an example). Given that motion is a high-level concept involving both spatial and temporal dimensions [65, 71], motion customization is considered a non-trivial task. Recently, many motion customization methods have been proposed to eliminate the influence of visual appearance in the reference video. Among them, a standout strategy is fine-tuning the pre-trained T2V diffusion model to reconstruct the frame differences of the reference video.