LayerT2V: Interactive Multi-Object Trajectory Layering for Video Generation

Cen, Kangrui, Zhao, Baixuan, Xin, Yi, Luo, Siqi, Zhai, Guangtao, Liu, Xiaohong

arXiv.org Artificial Intelligence 

Controlling object motion trajectories in T ext-to-Video (T2V) generation is a challenging and relatively under-explored area, particularly in scenarios involving multiple moving objects. Most community models and datasets in the T2V domain are designed for single-object motion, limiting the performance of current generative models in multi-object tasks. Additionally, existing motion control methods in T2V either lack support for multi-object motion scenes or experience severe performance degradation when object trajectories intersect, primarily due to the semantic conflicts in colliding regions. T o address these limitations, we introduce LayerT2V, the first approach for generating video by compositing background and foreground objects layer by layer . This layered generation enables flexible integration of multiple independent elements within a video, positioning each element on a distinct "layer" and thus facilitating coherent multi-object synthesis while enhancing control over the generation process. Extensive experiments demonstrate the superiority of LayerT2V in generating complex multi-object scenarios, showcasing 1.4 and 4.5 improvements in mIoU and AP50 metrics over state-of-the-art (SOTA) methods.