Measuring Progress in Fine-grained Vision-and-Language Understanding