c8e1620b29d546c2999a9339ab29aa82-Paper-Conference.pdf

Apr-27-2026, 16:38:10 GMT–Neural Information Processing Systems

Humans are remarkably flexible in understanding viewpoint changes due to visual cortex supporting the perception of 3D structure. In contrast, most of the computer vision models that learn visual representation from a pool of 2D images often fail to generalize over novel camera viewpoints. Recently, the vision architectures have shifted towards convolution-free architectures, visual Transformers, which operate on tokens derived from image patches. However, these Transformers do not perform explicit operations to learn viewpoint-agnostic representation for visual understanding. To this end, we propose a 3DToken Representation Layer (3DTRL) that estimates the 3D positional information of the visual tokens and leverages it for learning viewpoint-agnostic representations.

artificial intelligence, machine learning, representation, (15 more...)

Neural Information Processing Systems

Apr-27-2026, 16:38:10 GMT

Conferences PDF

Add feedback

Country:
- North America > United States (0.28)

Industry:
- Health & Medicine > Therapeutic Area (0.34)

Technology:
- Information Technology > Artificial Intelligence
  - Robots (1.00)
  - Machine Learning (1.00)
  - Vision > Image Understanding (0.31)

Duplicate Docs Excel Report

Title
Learning Viewpoint Agnostic Visual Representations by Recovering

Similar Docs Excel Report more

Title	Similarity	Source
None found