CounterfactualContrastiveLearningfor Weakly-SupervisedVision-LanguageGrounding

Neural Information Processing Systems 

Vision-language grounding is a fundamental and crucial problem in multi-modal understanding.