SpatialRGPT: Grounded Spatial Reasoning in Vision-Language Models

May-27-2025, 21:14:08 GMT–Neural Information Processing Systems

Vision Language Models (VLMs) have demonstrated remarkable performance in 2D vision and language tasks. However, their ability to reason about spatial arrangements remains limited. In this work, we introduce Spatial Region GPT (SpatialRGPT) to enhance VLMs' spatial perception and reasoning capabilities. SpatialRGPT advances VLMs' spatial understanding through two key innovations: (i) a data curation pipeline that enables effective learning of regional representation from 3D scene graphs, and (ii) a flexible plugin'' module for integrating depth information into the visual encoder of existing VLMs. During inference, when provided with user-specified region proposals, SpatialRGPT can accurately perceive their relative directions and distances.

artificial intelligence, natural language, spatial reasoning, (5 more...)

Neural Information Processing Systems

May-27-2025, 21:14:08 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence
  - Vision (0.69)
  - Natural Language (0.69)
  - Representation & Reasoning > Spatial Reasoning (0.45)