CLIP2TV: Align, Match and Distill for Video-Text Retrieval

Gao, Zijian, Liu, Jingyu, Sun, Weiqi, Chen, Sheng, Chang, Dedan, Zhao, Lili

Jul-21-2022–arXiv.org Artificial Intelligence

Modern video-text retrieval frameworks basically consist of three parts: video encoder, text encoder and the similarity head. With the success of both visual and textual representation learning, transformerbased encoders and fusion methods have also been adopted in the field of video-text retrieval. In this paper, We propose a new CLIP-based framework called CLIP2TV, which consists of a video-text alignment module and a video-text matching module. The two modules are trained end-toend in a coordinated manner, and boost the performance to each other. Moreover, to address the impairment brought by data noise, especially false negatives introduced by vague description in some datasets, we propose similarity distillation to alleviate the problem. Extensive experimental results on various datasets validate the effectiveness of the proposed methods. Finally, on common datasets of various length of video clips, CLIP2TV achieves better or competitive results towards previous SOTA methods.

arxiv preprint arxiv, retrieval, video, (13 more...)

arXiv.org Artificial Intelligence

Jul-21-2022

arXiv.org PDF

Add feedback

Country:
- North America > United States > North Carolina (0.04)

Genre:
- Research Report > New Finding (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Text Processing (0.49)
  - Representation & Reasoning > Information Fusion (0.34)
  - Machine Learning
    - Neural Networks (0.48)
    - Performance Analysis > Accuracy (0.34)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found