HSVLT: Hierarchical Scale-Aware Vision-Language Transformer for Multi-Label Image Classification

Ouyang, Shuyi, Wang, Hongyi, Niu, Ziwei, Bai, Zhenjia, Xie, Shiao, Xu, Yingying, Tong, Ruofeng, Chen, Yen-Wei, Lin, Lanfen

Jul-23-2024–arXiv.org Artificial Intelligence

The task of multi-label image classification involves recognizing multiple objects within a single image. Considering both valuable semantic information contained in the labels and essential visual features presented in the image, tight visual-linguistic interactions play a vital role in improving classification performance. Moreover, given the potential variance in object size and appearance within a single image, attention to features of different scales can help to discover possible objects in the image. Recently, Transformer-based methods have achieved great success in multi-label image classification by leveraging the advantage of modeling long-range dependencies, but they have several limitations. Firstly, existing methods treat visual feature extraction and cross-modal fusion as separate steps, resulting in insufficient visual-linguistic alignment in the joint semantic space. Additionally, they only extract visual features and perform cross-modal fusion at a single scale, neglecting objects with different characteristics. To address these issues, we propose a Hierarchical Scale-Aware Vision-Language Transformer (HSVLT) with two appealing designs: (1)~A hierarchical multi-scale architecture that involves a Cross-Scale Aggregation module, which leverages joint multi-modal features extracted from multiple scales to recognize objects of varying sizes and appearances in images. (2)~Interactive Visual-Linguistic Attention, a novel attention mechanism module that tightly integrates cross-modal interaction, enabling the joint updating of visual, linguistic and multi-modal features. We have evaluated our method on three benchmark datasets. The experimental results demonstrate that HSVLT surpasses state-of-the-art methods with lower computational cost.

classification, hsvlt, proceedings, (13 more...)

arXiv.org Artificial Intelligence

Jul-23-2024

arXiv.org PDF

Add feedback

Country:
- North America
  - United States > New York
    - New York County > New York City (0.04)
  - Canada > Ontario
    - National Capital Region > Ottawa (0.05)
- Europe
  - Switzerland > Zürich
    - Zürich (0.14)
  - Netherlands > North Holland
    - Amsterdam (0.04)
- Asia
  - Singapore (0.04)
  - Japan (0.04)
  - China > Zhejiang Province
    - Hangzhou (0.05)

Genre:
- Research Report
  - Promising Solution (0.34)
  - New Finding (0.34)

Technology:
- Information Technology
  - Sensing and Signal Processing > Image Processing (1.00)
  - Artificial Intelligence
    - Vision > Image Understanding (1.00)
    - Machine Learning > Neural Networks
      - Deep Learning (0.49)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found