FDBPL: Faster Distillation-Based Prompt Learning for Region-Aware Vision-Language Models Adaptation

Zhang, Zherui, Wu, Jiaxin, Wang, Changwei, Xu, Rongtao, Huang, Longzhao, Xu, Wenhao, Xu, Wenbo, Guo, Li, Xu, Shibiao

arXiv.org Artificial Intelligence 

Prompt learning as a parameter-efficient method that has been widely adopted to adapt Vision-Language Models (VLMs) to downstream tasks. While hard-prompt design requires domain expertise and iterative optimization, soft-prompt methods rely heavily on task-specific hard labels, limiting their generalization to unseen categories. Recent popular distillation-based prompt learning methods improve generalization by exploiting larger teacher VLMs and unsupervised knowledge transfer, yet their repetitive teacher model online inference sacrifices the inherent training efficiency advantage of prompt learning. Furthermore, FDBPL introduces a region-aware prompt learning paradigm with dual positive-negative prompt spaces to fully exploit randomly cropped regions that containing multi-level information. We propose a positive-negative space mutual learning mechanism based on similarity-difference learning, enabling student CLIP models to recognize correct semantics while learning to reject weakly related concepts, thereby improving zero-shot performance. Unlike existing distillation-based prompt learning methods that sacrifice parameter efficiency for generalization, FDBPL maintains dual advantages of parameter efficiency and strong downstream generalization. Comprehensive evaluations across 11 datasets demonstrate superior performance in base-to-new generalization, cross-dataset transfer, and robustness tests, achieving 2 .2 Introduction With the rapid development of deep learning technology, traditional models that rely only on visual features for precise recognition are no longer sufficient to meet the diverse task requirements found in open-world scenarios. Specifically, CLIP employs a dual-tower architecture, comprising an image encoder and a text encoder, which combines natural language processing with computer vision and has completed extensive self-supervised pre-training on the vast amount of image-text pairs. VLMs like CLIP demonstrate remarkable zero-shot abilities, but require adaptation for optimal performance on downstream tasks.