CLIP Model for Images to Textual Prompts Based on Top-k Neighbors

Zhang, Xin, Zhang, Xin, Cai, YeMing, Jia, Tianzhi

arXiv.org Artificial Intelligence 

Text-to-image synthesis, a subfield of multimodal generation, has gained significant We propose a cost-effective approach for image-toprompt attention in recent years. We propose a costeffective generation that leverages generative models approach for image-to-prompt generation to generate textual prompts without the need for that leverages generative models to generate textual large amounts of annotated data. Our method allows prompts without the need for large amounts of for direct utilization of the generated prompts or annotated data. We divide our method into two serves as valuable initialization for data-efficient stages: online stage and offline stage. We use a fine-tuning processes. This approach significantly combination of the CLIP model and K-nearest reduces data costs and time consumption while neighbors (KNN) algorithm. The proposed system achieving high quality and diversity in the consists of two main parts: an offline task and an generation of prompts related to input images.