VT-CLIP: Enhancing Vision-Language Models with Visual-guided Texts

Qiu, Longtian, Zhang, Renrui, Guo, Ziyu, Zeng, Ziyao, Guo, Zilu, Li, Yafeng, Zhang, Guangnan

arXiv.org Artificial Intelligence 

Inspired by prompt tuning in natural deep learning has achieved extraordinary performances over language processing [5], Context Optimization (CoOp) [6] a wide range of vision tasks, such as image classification [1], freezes CLIP's pre-trained weights and adopts learnable object detection [2] and so on [3]. Although these methods prompts to learn the best-fitted textual inputs other than the are expert at specific scenarios, they lack the ability of general original hand-crafted ones. From another perspective, CLIPvision representation and are hard to transfer to open-set Adapter [?] appends a lightweight adapter module [7] over applications. Considering the wide coverage of languages, the image or text encoders and is also fine-tuned with frozen CLIP [4] proposes to pre-train the visual representation learning CLIP's weights. However, they are constrained by the following contrastively with natural language signals.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found