Controlling for Stereotypes in Multimodal Language Model Evaluation