Enhance Multimodal Consistency and Coherence for Text-Image Plan Generation

Open in new window