Moving Off-the-Grid: Scene-Grounded Video Representations

Mar-22-2026, 17:34:10 GMT–Neural Information Processing Systems

Current vision models typically maintain a fixed correspondence between their representation structure and image space.Each layer comprises a set of tokens arranged "on-the-grid," which biases patches or tokens to encode information at a specific spatio(-temporal) location. In this work we present (MooG), a self-supervised video representation model that offers an alternative approach, allowing tokens to move "off-the-grid" to better enable them to represent scene elements consistently, even as they move across the image plane through time.

artificial intelligence, name change, proceedings, (3 more...)

Neural Information Processing Systems

Mar-22-2026, 17:34:10 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Vision (0.39)