Fine-Grained Classification: Connecting Metadata via Cross-Contrastive Pre-Training
–arXiv.org Artificial Intelligence
Abstract--Fine-grained visual classification aims to recognize objects belonging to many subordinate categories of a supercat-egory, where appearance alone often fails to distinguish highly similar classes. We propose a unified framework that integrates image, text, and metadata via cross-contrastive pre-training. We first align the three modality encoders in a shared embedding space and then fine-tune the image and metadata encoders for classification. On NABirds [1], our approach improves over the baseline by 7.83% and achieves 84.44% top-1 accuracy, outperforming strong multimodal methods. The challenge is that inter-class differences are subtle while intra-class variation (pose, background, lighting) can be large.
arXiv.org Artificial Intelligence
Oct-21-2025