Fine-Grained Classification: Connecting Metadata via Cross-Contrastive Pre-Training

Oct-21-2025–arXiv.org Artificial Intelligence

Abstract--Fine-grained visual classification aims to recognize objects belonging to many subordinate categories of a supercat-egory, where appearance alone often fails to distinguish highly similar classes. We propose a unified framework that integrates image, text, and metadata via cross-contrastive pre-training. We first align the three modality encoders in a shared embedding space and then fine-tune the image and metadata encoders for classification. On NABirds [1], our approach improves over the baseline by 7.83% and achieves 84.44% top-1 accuracy, outperforming strong multimodal methods. The challenge is that inter-class differences are subtle while intra-class variation (pose, background, lighting) can be large.

artificial intelligence, machine learning, metadata, (13 more...)

arXiv.org Artificial Intelligence

Oct-21-2025

arXiv.org PDF

Add feedback

Genre:
- Research Report (0.66)

Technology:
- Information Technology
  - Sensing and Signal Processing > Image Processing (1.00)
  - Artificial Intelligence
    - Vision (1.00)
    - Machine Learning > Neural Networks (0.69)