STANCE: Motion Coherent Video Generation Via Sparse-to-Dense Anchored Encoding

Chen, Zhifei, Xu, Tianshuo, Wu, Leyi, Wang, Luozhou, Yan, Dongyu, You, Zihan, Luo, Wenting, Zhang, Guo, Chen, Yingcong

arXiv.org Artificial Intelligence 

MIT Figure 1: Videos generated by ST ANCE. Controls yield physically meaningful edits while preserving appearance: increasing mass can reverse collision outcomes, larger speeds produce longer travel and earlier contact, and rotating the arrow reorients trajectories and shifts contact points; z disambiguates out-of-plane intent under camera motion. Examples span both simple collision setups and realistic scenes, including gentle pushes that dislodge or trigger collisions. Video generation has recently made striking visual progress, but maintaining coherent object motion and interactions remains difficult. We trace two practical bottlenecks: (i) human-provided motion hints (e.g., small 2D maps) often collapse to too few effective tokens after encoding, weakening guidance; and (ii) optimizing for appearance and motion in a single head can favor texture over temporal consistency. First, we introduce Instance Cues--a pixel-aligned control signal that turns sparse, user-editable hints into a dense 2.5D (camera-relative) motion field by averaging per-instance flow and augmenting with monocular depth over the instance mask. This reduces depth ambiguity compared to 2D drag/arrow inputs while remaining easy to user. Second, we preserve the salience of these cues in token space with Dense RoPE, which tags a small set of motion tokens (anchored on the first frame) with spatial-addressable rotary embeddings.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found