Vision and language pretraining in the absence of caption annotations

Open in new window