Goto

Collaborating Authors

 Spatial Reasoning


SpatialRGPT: Grounded Spatial Reasoning in Vision-Language Models

Neural Information Processing Systems

Vision Language Models (VLMs) have demonstrated remarkable performance in 2D vision and language tasks. However, their ability to reason about spatial arrangements remains limited. In this work, we introduce Spatial Region GPT (SpatialRGPT) to enhance VLMs' spatial perception and reasoning capabilities.


Synergistic Dual Spatial-aware Generation of Image-to-Text and Text-to-Image Y u Zhao

Neural Information Processing Systems

In the visual spatial understanding (VSU) area, spatial image-to-text (SI2T) and spatial text-to-image (ST2I) are two fundamental tasks that appear in dual form. Existing methods for standalone SI2T or ST2I perform imperfectly in spatial understanding, due to the difficulty of 3D-wise spatial feature modeling.







Supplementary Material: T orchSpatial-A Location Encoding Framework and Benchmark for Spatial Representation Learning

Neural Information Processing Systems

Author ordering is determined by coin flip. For what purpose was the dataset created? Was there a specific task in mind? In order to systematically compare the location encoders' performance and their impact on the Who created the dataset (e.g., which team, research group) and on behalf of which entity (e.g., Who funded the creation of the dataset? Dr. Gengchen Mai acknowledges the Microsoft Research What do the instances that comprise the dataset represent (e.g., documents, photos, people, The instances in all 17 datasets represent images.



Supplementary Material for " Diversifying Spatial-Temporal Perception for Video Domain Generalization " Kun-Y u Lin

Neural Information Processing Systems

Hard Norm Alignment loss (HNA): apply the HNA loss (Eq. HMDB, which demonstrates the effectiveness of our model. First, we drop feature from a specific spatial group. Method UCF HMDB STDN-T -1 59.2 STDN-T -2 58.1 STDN-T -3 59.4 STDN-T -4 58.9 Full STDN 60.2 Second, we drop feature from a space scale. In our main manuscript, we conduct all experiments based on ResNet-50.