ordinalclip
OrdinalCLIP: Learning Rank Prompts for Language-Guided Ordinal Regression
This paper presents a language-powered paradigm for ordinal regression. Existing methods usually treat each rank as a category and employ a set of weights to learn these concepts. These methods are easy to overfit and usually attain unsatisfactory performance as the learned concepts are mainly derived from the training set. Recent large pre-trained vision-language models like CLIP have shown impressive performance on various visual tasks. In this paper, we propose to learn the rank concepts from the rich semantic CLIP latent space. Specifically, we reformulate this task as an image-language matching problem with a contrastive objective, which regards labels as text and obtains a language prototype from a text encoder for each rank. While prompt engineering for CLIP is extremely time-consuming, we propose OrdinalCLIP, a differentiable prompting method for adapting CLIP for ordinal regression.
A Experimental Settings All experiments were conducted on a single NVIDIA RTX 3090 GPU
All experiments were conducted on a single NVIDIA RTX 3090 GPU. The obtained text features were also projected into the CLIP latent space via an FC layer. The test images followed the same process except that the center cropping was used. Besides, the classification accuracy is adopted for Adience. Image Aesthetics Assessment An ImageNet pre-trained VGG-16 was used as the image encoder.
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Sensing and Signal Processing > Image Processing (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)
OrdinalCLIP: Learning Rank Prompts for Language-Guided Ordinal Regression
This paper presents a language-powered paradigm for ordinal regression. Existing methods usually treat each rank as a category and employ a set of weights to learn these concepts. These methods are easy to overfit and usually attain unsatisfactory performance as the learned concepts are mainly derived from the training set. Recent large pre-trained vision-language models like CLIP have shown impressive performance on various visual tasks. In this paper, we propose to learn the rank concepts from the rich semantic CLIP latent space. Specifically, we reformulate this task as an image-language matching problem with a contrastive objective, which regards labels as text and obtains a language prototype from a text encoder for each rank.