VT-CLIP: Enhancing Vision-Language Models with Visual-guided Texts

Qiu, Longtian, Zhang, Renrui, Guo, Ziyu, Zeng, Ziyao, Guo, Zilu, Li, Yafeng, Zhang, Guangnan

Aug-10-2023–arXiv.org Artificial Intelligence

Inspired by prompt tuning in natural deep learning has achieved extraordinary performances over language processing [5], Context Optimization (CoOp) [6] a wide range of vision tasks, such as image classification [1], freezes CLIP's pre-trained weights and adopts learnable object detection [2] and so on [3]. Although these methods prompts to learn the best-fitted textual inputs other than the are expert at specific scenarios, they lack the ability of general original hand-crafted ones. From another perspective, CLIPvision representation and are hard to transfer to open-set Adapter [?] appends a lightweight adapter module [7] over applications. Considering the wide coverage of languages, the image or text encoders and is also fine-tuned with frozen CLIP [4] proposes to pre-train the visual representation learning CLIP's weights. However, they are constrained by the following contrastively with natural language signals.

machine learning, natural language, vt-clip, (20 more...)

arXiv.org Artificial Intelligence

Aug-10-2023

arXiv.org PDF

Add feedback

Genre:
- Research Report (0.50)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks (0.34)
  - Natural Language (1.00)
  - Vision > Image Understanding (0.36)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found