Vision-Language Models Struggle to Align Entities across Modalities