Moving Off-the-Grid: Scene-Grounded Video Representations

Neural Information Processing Systems 

Current vision models typically maintain a fixed correspondence between their representation structure and image space. Each layer comprises a set of tokens arranged "on-the-grid," which biases patches or tokens to encode information at

Similar Docs  Excel Report  more

TitleSimilaritySource
None found