From Association to Generation: Text-only Captioning by Unsupervised Cross-modal Mapping