Explainable Semantic Space by Grounding Language to Vision with Cross-Modal Contrastive Learning

Open in new window