CLIP2TV: Align, Match and Distill for Video-Text Retrieval
Gao, Zijian, Liu, Jingyu, Sun, Weiqi, Chen, Sheng, Chang, Dedan, Zhao, Lili
–arXiv.org Artificial Intelligence
Modern video-text retrieval frameworks basically consist of three parts: video encoder, text encoder and the similarity head. With the success of both visual and textual representation learning, transformerbased encoders and fusion methods have also been adopted in the field of video-text retrieval. In this paper, We propose a new CLIP-based framework called CLIP2TV, which consists of a video-text alignment module and a video-text matching module. The two modules are trained end-toend in a coordinated manner, and boost the performance to each other. Moreover, to address the impairment brought by data noise, especially false negatives introduced by vague description in some datasets, we propose similarity distillation to alleviate the problem. Extensive experimental results on various datasets validate the effectiveness of the proposed methods. Finally, on common datasets of various length of video clips, CLIP2TV achieves better or competitive results towards previous SOTA methods.
arXiv.org Artificial Intelligence
Jul-21-2022
- Country:
- North America > United States > North Carolina (0.04)
- Genre:
- Research Report > New Finding (0.46)
- Technology: