Unleashing the Power of Vision-Language Models for Long-Tailed Multi-Label Visual Recognition

Tang, Wei, Wang, Zuo-Zheng, Zhang, Kun, Wei, Tong, Zhang, Min-Ling

arXiv.org Artificial Intelligence 

Abstract--Long-tailed multi-label visual recognition poses a significant challenge, as images typically contain multiple labels with highly imbalanced class distributions, leading to biased models that favor head classes while underperforming on tail classes. Recent efforts have leveraged pre-trained vision-language models, such as CLIP, alongside long-tailed learning techniques to exploit rich visual-textual priors for improved performance. However, existing methods often derive semantic inter-class relationships directly from imbalanced datasets, resulting in unreliable correlations for tail classes due to data scarcity . Moreover, CLIP's zero-shot paradigm is optimized for single-label image-text matching, making it suboptimal for multi-label tasks. The framework incorporates a graph convolutional network for label-aware propagation and learnable soft prompts for refined embeddings. Moreover, it improves generalization through test-time ensembling and realigns visual-textual modalities using parameter-efficient fine-tuning to avert overfitting on tail classes without compromising head class performance. Recent progress in long-tailed visual recognition has mainly centered on the single-label multi-class setting. However, real-world images often encompass multiple objects and concepts, giving rise to long-tailed multi-label visual recognition (LTML), where each image is associated with multiple labels exhibiting long-tailed distributions [5], [6], [7], [8], [9], [10]. LTML introduces additional intricacies compared to its single-label multi-class counterpart, including label co-occurrences, which create interdependencies among classes, and intra-class imbalances between positive and negative instances. Existing LTML methods have primarily relied on strategies like class re-sampling, loss re-weighting, and specialized architectures, typically built upon ImageNet-pre-trained convolutional neural networks (CNNs) as backbones [5], [6], [7]. Wei Tang is with the School of Computer Science and Engineering, Southeast University, Nanjing 210096, China, the Key Laboratory of Computer Network and Information Integration (Southeast University), MoE, China, and with the Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), Abu Dhabi, UAE (e-mail: tangw@seu.edu.cn).

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found