Mining Fine-Grained Image-Text Alignment for Zero-Shot Captioning via Text-Only Training