HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval

Liu, Song, Fan, Haoqi, Qian, Shengsheng, Chen, Yiru, Ding, Wenkui, Wang, Zhongyuan

Mar-28-2021–arXiv.org Artificial Intelligence

Video-Text Retrieval has been a hot research topic with the explosion of multimedia data on the Internet. Transformer for video-text learning has attracted increasing attention due to the promising performance.However, existing cross-modal transformer approaches typically suffer from two major limitations: 1) Limited exploitation of the transformer architecture where different layers have different feature characteristics. 2) End-to-end training mechanism limits negative interactions among samples in a mini-batch. In this paper, we propose a novel approach named Hierarchical Transformer (HiT) for video-text retrieval. HiT performs hierarchical cross-modal contrastive matching in feature-level and semantic-level to achieve multi-view and comprehensive retrieval results. Moreover, inspired by MoCo, we propose Momentum Cross-modal Contrast for cross-modal learning to enable large-scale negative interactions on-the-fly, which contributes to the generation of more precise and discriminative representations. Experimental results on three major Video-Text Retrieval benchmark datasets demonstrate the advantages of our methods.

encoder, representation, retrieval, (14 more...)

arXiv.org Artificial Intelligence

Mar-28-2021

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - Washington > King County
    - Seattle (0.04)
  - Massachusetts > Suffolk County
    - Boston (0.04)
  - Louisiana > Orleans Parish
    - New Orleans (0.04)
  - Hawaii > Honolulu County
    - Honolulu (0.04)
- Europe
  - United Kingdom > Wales
    - Cardiff (0.04)
  - Italy > Veneto
    - Venice (0.04)
  - Germany > Bavaria
    - Upper Bavaria > Munich (0.04)
- Asia
  - South Korea > Seoul
    - Seoul (0.04)
  - Middle East > Qatar
    - Ad-Dawhah > Doha (0.04)
  - Japan > Honshū
    - Kantō > Kanagawa Prefecture > Yokohama (0.04)

Genre:
- Research Report > Promising Solution (0.34)

Industry:
- Education (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.89)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found