Improving Cross-modal Alignment with Synthetic Pairs for Text-only Image Captioning