CREPE: Can Vision-Language Foundation Models Reason Compositionally?