Improving multimodal datasets with image captioning