Cross-modal Attention Congruence Regularization for Vision-Language Relation Alignment

Open in new window