HiCLIP: Contrastive Language-Image Pretraining with Hierarchy-aware Attention

Geng, Shijie, Yuan, Jianbo, Tian, Yu, Chen, Yuxiao, Zhang, Yongfeng

Mar-6-2023–arXiv.org Artificial Intelligence

The success of large-scale contrastive vision-language pretraining (CLIP) has benefited both visual recognition and multimodal content understanding. The concise design brings CLIP the advantage in inference efficiency against other visionlanguage models with heavier cross-attention fusion layers, making it a popular choice for a wide spectrum of downstream tasks. However, CLIP does not explicitly capture the hierarchical nature of high-level and fine-grained semantics conveyed in images and texts, which is arguably critical to vision-language understanding and reasoning. To this end, we equip both the visual and language branches in CLIP with hierarchy-aware attentions, namely Hierarchy-aware CLIP (HiCLIP), to progressively discover semantic hierarchies layer-by-layer from both images and texts in an unsupervised manner. As a result, such hierarchical aggregation significantly improves the cross-modal alignment. To demonstrate the advantages of HiCLIP, we conduct qualitative analysis on its unsupervised hierarchy induction during inference, as well as extensive quantitative experiments on both visual recognition and vision-language downstream tasks. In recent years, vision-language pretraining has achieved significant progress pairing with large-scale multimodal data. Contrastive vision-language pretraining (CLIP) features its generalization ability for zero-shot tasks and robustness to domain shift (Radford et al., 2021). Moreover, the spectrum of problems that CLIP can solve range from visual recognition, image-text retrieval, and vision-language reasoning tasks via providing appropriate prompt engineering (Zhou et al., 2022; Gao et al., 2021; Xu et al., 2021; Shridhar et al., 2021; Rao et al., 2022; Zhong et al., 2022).

artificial intelligence, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

Mar-6-2023

arXiv.org PDF

Add feedback

Country:
- Europe > Italy (0.28)

Genre:
- Research Report (0.82)

Industry:
- Leisure & Entertainment > Sports (0.46)

Technology:
- Information Technology
  - Artificial Intelligence
    - Machine Learning > Neural Networks
      - Deep Learning (1.00)
    - Natural Language > Grammars & Parsing (1.00)
    - Vision (1.00)
  - Sensing and Signal Processing > Image Processing (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found