Fill the Gap: Quantifying and Reducing the Modality Gap in Image-Text Representation Learning

Open in new window