Improving Cross-modal Alignment with Synthetic Pairs for Text-only Image Captioning

Open in new window