Cross-Modal Attribute Insertions for Assessing the Robustness of Vision-and-Language Learning

Open in new window