VT-CLIP: Enhancing Vision-Language Models with Visual-guided Texts
Qiu, Longtian, Zhang, Renrui, Guo, Ziyu, Zeng, Ziyao, Guo, Zilu, Li, Yafeng, Zhang, Guangnan
–arXiv.org Artificial Intelligence
Inspired by prompt tuning in natural deep learning has achieved extraordinary performances over language processing [5], Context Optimization (CoOp) [6] a wide range of vision tasks, such as image classification [1], freezes CLIP's pre-trained weights and adopts learnable object detection [2] and so on [3]. Although these methods prompts to learn the best-fitted textual inputs other than the are expert at specific scenarios, they lack the ability of general original hand-crafted ones. From another perspective, CLIPvision representation and are hard to transfer to open-set Adapter [?] appends a lightweight adapter module [7] over applications. Considering the wide coverage of languages, the image or text encoders and is also fine-tuned with frozen CLIP [4] proposes to pre-train the visual representation learning CLIP's weights. However, they are constrained by the following contrastively with natural language signals.
arXiv.org Artificial Intelligence
Aug-10-2023
- Genre:
- Research Report (0.50)
- Technology:
- Information Technology > Artificial Intelligence
- Machine Learning > Neural Networks (0.34)
- Natural Language (1.00)
- Vision > Image Understanding (0.36)
- Information Technology > Artificial Intelligence