Vision Language Pre-training by Contrastive Learning with Cross-Modal Similarity Regulation

Open in new window