Mining Fine-Grained Image-Text Alignment for Zero-Shot Captioning via Text-Only Training

Open in new window