Can Vision Language Models Learn from Visual Demonstrations of Ambiguous Spatial Reasoning?

Open in new window