Vision and language pretraining in the absence of caption annotations