CLIP Model for Images to Textual Prompts Based on Top-k Neighbors

Zhang, Xin, Zhang, Xin, Cai, YeMing, Jia, Tianzhi

Jan-18-2024–arXiv.org Artificial Intelligence

Text-to-image synthesis, a subfield of multimodal generation, has gained significant We propose a cost-effective approach for image-toprompt attention in recent years. We propose a costeffective generation that leverages generative models approach for image-to-prompt generation to generate textual prompts without the need for that leverages generative models to generate textual large amounts of annotated data. Our method allows prompts without the need for large amounts of for direct utilization of the generated prompts or annotated data. We divide our method into two serves as valuable initialization for data-efficient stages: online stage and offline stage. We use a fine-tuning processes. This approach significantly combination of the CLIP model and K-nearest reduces data costs and time consumption while neighbors (KNN) algorithm. The proposed system achieving high quality and diversity in the consists of two main parts: an offline task and an generation of prompts related to input images.

dataset, prompt text, representation, (15 more...)

arXiv.org Artificial Intelligence

Jan-18-2024

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - California > Santa Cruz County > Santa Cruz (0.15)
- Europe
  - Switzerland > Zürich
    - Zürich (0.14)
  - Germany > Bavaria
    - Upper Bavaria > Munich (0.04)
- Asia > China
  - Hubei Province > Wuhan (0.05)
  - Beijing > Beijing (0.05)

Genre:
- Research Report (0.64)

Technology:
- Information Technology
  - Sensing and Signal Processing > Image Processing (1.00)
  - Artificial Intelligence
    - Vision (1.00)
    - Natural Language (1.00)
    - Machine Learning > Neural Networks
      - Deep Learning (0.30)