Bringing Image Scene Structure to Video via Frame-Clip Consistency of Object Tokens

Dec-24-2025, 23:22:07 GMT–Neural Information Processing Systems

Recent action recognition models have achieved impressive results by integrating objects, their locations and interactions. However, obtaining dense structured annotations for each frame is tedious and time-consuming, making these methods expensive to train and less scalable. At the same time, if a small set of annotated images is available, either within or outside the domain of interest, how could we leverage these for a video downstream task? We propose a learning framework StructureViT (SViT for short), which demonstrates how utilizing the structure of a small number of images only available during training can improve a video model. SViT relies on two key insights.

bringing image scene structure, frame-clip consistency, name change, (6 more...)

Neural Information Processing Systems

Dec-24-2025, 23:22:07 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Vision (0.39)