ROME: Evaluating Pre-trained Vision-Language Models on Reasoning beyond Visual Common Sense

Open in new window