Cross-Modal Attribute Insertions for Assessing the Robustness of Vision-and-Language Learning