Visual Structures Help Visual Reasoning: Addressing the Binding Problem in LVLMs

Neural Information Processing Systems 

Despite progress in Large Vision-Language Models (LVLMs), their capacity for visual reasoning is often limited by the binding problem: the failure to reliably associate perceptual features with their correct visual referents.