CREPE: Can Vision-Language Foundation Models Reason Compositionally?

Open in new window