Text encoders bottleneck compositionality in contrastive vision-language models

Open in new window