Towards Understanding Visual Grounding in Visual Language Models

Open in new window