Can Linguistic Knowledge Improve Multimodal Alignment in Vision-Language Pretraining?

Open in new window