Why Is Spatial Reasoning Hard for VLMs? An Attention Mechanism Perspective on Focus Areas

Open in new window