Cross-modal Attention Congruence Regularization for Vision-Language Relation Alignment