Visual Structures Helps Visual Reasoning: Addressing the Binding Problem in VLMs
Izadi, Amirmohammad, Banayeeanzade, Mohammad Ali, Askari, Fatemeh, Rahimiakbar, Ali, Vahedi, Mohammad Mahdi, Hasani, Hosein, Baghshah, Mahdieh Soleymani
–arXiv.org Artificial Intelligence
Despite progress in Large Vision-Language Models (LVLMs), their capacity for visual reasoning is often limited by the binding problem: the failure to reliably associate perceptual features with their correct visual referents. This limitation underlies persistent errors in tasks such as counting, visual search, scene description, and spatial relationship understanding. A key factor is that current LVLMs process visual features largely in parallel, lacking mechanisms for spatially grounded, serial attention. This paper introduces Visual Input Structure for Enhanced Reasoning (VISER), a simple, effective method that augments visual inputs with low-level spatial structures and pairs them with a textual prompt that encourages sequential, spatially-aware parsing. We empirically demonstrate substantial performance improvements across core visual reasoning tasks, using only a single-query inference. Specifically, VISER improves GPT-4o performance on visual search, counting, and spatial relationship tasks by 25.0%, 26.8%, and 9.5%, respectively, and reduces edit distance error in scene description by 0.32 on 2D datasets. Furthermore, we find that the visual modification is essential for these gains; purely textual strategies, including Chain-of-Thought prompting, are insufficient and can even degrade performance. VISER underscores the importance of visual input design over purely linguistically based reasoning strategies and suggests that visual structuring is a powerful and general approach for enhancing compositional and spatial reasoning in LVLMs.
arXiv.org Artificial Intelligence
Nov-11-2025
- Genre:
- Research Report > New Finding (1.00)
- Technology:
- Information Technology > Artificial Intelligence
- Cognitive Science > Problem Solving (0.88)
- Machine Learning > Neural Networks
- Deep Learning (0.92)
- Natural Language
- Chatbot (1.00)
- Large Language Model (1.00)
- Representation & Reasoning (1.00)
- Vision (1.00)
- Information Technology > Artificial Intelligence