Vision-Language Models Struggle to Align Entities across Modalities

Open in new window