Nemesis: Normalizing the Soft-prompt Vectors of Vision-Language Models
Fu, Shuai, Wang, Xiequn, Huang, Qiushi, Zhang, Yu
–arXiv.org Artificial Intelligence
With the prevalence of large-scale pretrained vision-language models (VLMs), such as CLIP, soft-prompt tuning has become a popular method for adapting these models to various downstream tasks. However, few works delve into the inherent properties of learnable soft-prompt vectors, specifically the impact of their norms to the performance of VLMs. This motivates us to pose an unexplored research question: "Do we need to normalize the soft prompts in VLMs?" To fill this research gap, we first uncover a phenomenon, called the Low-Norm Effect by performing extensive corruption experiments, suggesting that reducing the norms of certain learned prompts occasionally enhances the performance of VLMs, while increasing them often degrades it. To harness this effect, we propose a novel method named Normalizing the soft-prompt vectors of vision-language models (Nemesis) to normalize soft-prompt vectors in VLMs. To the best of our knowledge, our work is the first to systematically investigate the role of norms of soft-prompt vector in VLMs, offering valuable insights for future research in soft-prompt tuning. The code is available at https://github.com/ShyFoo/Nemesis. In the age of large-scale pretrained vision-language models (VLMs), such as CLIP (Radford et al., 2021), Flamingo (Alayrac et al., 2022), and BLIP (Li et al., 2022), soft-prompt-based methods, also known as prompt-tuning, have emerged as a dominant approach for adapting these models to a wide range of downstream tasks. For instance, Zhou et al. (2022b) propose a Context Optimization (CoOp) method to learn soft prompts in a continuous space of CLIP for image classification tasks. Additionally, Rao et al. (2022) and Du et al. (2022) also employ prompt-tuning to address dense prediction and open-vocabulary object detection tasks, respectively. Recent research in the field of VLMs has been primarily focused on enhancing model performance through the alignment of visual and textual features. For instance, in (Lu et al., 2022), the weight distribution of output embeddings is estimated, while Zang et al. (2022) propose a joint optimization approach for prompts across multiple modalities.
arXiv.org Artificial Intelligence
Aug-25-2024
- Country:
- Asia > China > Guangdong Province > Shenzhen (0.04)
- Genre:
- Research Report > New Finding (0.48)
- Technology:
- Information Technology > Artificial Intelligence
- Natural Language (1.00)
- Machine Learning > Neural Networks (0.46)
- Vision > Image Understanding (0.34)
- Representation & Reasoning > Optimization (0.34)
- Information Technology > Artificial Intelligence