Punching Bag vs. Punching Person: Motion Transferability in Videos

Abdullah, Raiyaan, Claypoole, Jared, Cogswell, Michael, Divakaran, Ajay, Rawat, Yogesh

arXiv.org Artificial Intelligence 

Action recognition models demonstrate strong generalization, but can they effectively transfer high-level motion concepts across diverse contexts, even within similar distributions? F or example, can a model recognize the broad action "punching" when presented with an unseen variation such as "punching person"? T o explore this, we introduce a motion transferability framework with three datasets: (1) Syn-TA, a synthetic dataset with 3D object motions; (2) Kinetics400-TA; and (3) Something-Something-v2-TA, both adapted from natural video datasets. W e evaluate 13 state-of-the-art models on these benchmarks and observe a significant drop in performance when recognizing high-level actions in novel contexts. Our analysis reveals: 1) Multimodal models struggle more with fine-grained unknown actions than with coarse ones; 2) The bias-free Syn-TA proves as challenging as real-world datasets, with models showing greater performance drops in controlled settings; 3) Larger models improve transferability when spatial cues dominate but struggle with intensive temporal reasoning, while reliance on object and background cues hinders generalization. W e further explore how disentangling coarse and fine motions can improve recognition in temporally challenging datasets. W e believe this study establishes a crucial benchmark for assessing motion transferability in action recognition.