SPHERE: A Hierarchical Evaluation on Spatial Perception and Reasoning for Vision-Language Models
Zhang, Wenyu, Ng, Wei En, Ma, Lixin, Wang, Yuwen, Zhao, Jungqi, Li, Boyang, Wang, Lu
–arXiv.org Artificial Intelligence
Current vision-language models may incorporate single-dimensional spatial cues, such as depth, object boundary, and basic spatial directions (e.g. left, right, front, back), yet often lack the multi-dimensional spatial reasoning necessary for human-like understanding and real-world applications. To address this gap, we develop SPHERE (Spatial Perception and Hierarchical Evaluation of REasoning), a hierarchical evaluation framework with a new human-annotated dataset to pinpoint model strengths and weaknesses, advancing from single-skill tasks to multi-skill tasks, and ultimately to complex reasoning tasks that require the integration of multiple spatial and visual cues with logical reasoning. Benchmark evaluation of state-of-the-art open-source models reveal significant shortcomings, especially in the abilities to understand distance and proximity, to reason from both allocentric and egocentric viewpoints, and to perform complex reasoning in a physical context. This work underscores the need for more advanced approaches to spatial understanding and reasoning, paving the way for improvements in vision-language models and their alignment with human-like spatial capabilities. The dataset will be open-sourced upon publication.
arXiv.org Artificial Intelligence
Dec-17-2024
- Genre:
- Research Report (0.64)
- Technology:
- Information Technology > Artificial Intelligence
- Natural Language (1.00)
- Vision (1.00)
- Information Technology > Artificial Intelligence