Intriguing Differences Between Zero-Shot and Systematic Evaluations of Vision-Language Transformer Models