Enhance Multimodal Consistency and Coherence for Text-Image Plan Generation