Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge Transferring

Liu, Ruyang, Huang, Jingjia, Li, Ge, Feng, Jiashi, Wu, Xinglong, Li, Thomas H.

arXiv.org Artificial Intelligence 

However, it is hard to get a pretrained model as powerful as CLIP in the video Image-text pretrained models, e.g., CLIP, have shown domain due to the unaffordable demands on computation resources impressive general multi-modal knowledge learned from and the difficulty of collecting video-text data pairs large-scale image-text data pairs, thus attracting increasing as large and diverse as image-text data. Instead of directly attention for their potential to improve visual representation pursuing video-text pretrained models [17, 27], a potential learning in the video domain. In this paper, based alternative solution that benefits video downstream tasks is on the CLIP model, we revisit temporal modeling in the to transfer the knowledge in image-text pretrained models context of image-to-video knowledge transferring, which is to the video domain, which has attracted increasing attention the key point for extending image-text pretrained models to in recent years [12, 13, 26, 29, 30, 41]. the video domain. We find that current temporal modeling Extending pretrained 2D image models to the video domain mechanisms are tailored to either high-level semanticdominant is a widely-studied topic in deep learning [4, 7], and tasks (e.g., retrieval) or low-level visual patterndominant the key point lies in empowering 2D models with the capability tasks (e.g., recognition), and fail to work on the of modeling temporal dependency between video two cases simultaneously. The key difficulty lies in modeling frames while taking advantages of knowledge in the pretrained temporal dependency while taking advantage of both highlevel models. In this paper, based on CLIP [32], we revisit and low-level knowledge in CLIP model. To tackle temporal modeling in the context of image-to-video knowledge this problem, we present Spatial-Temporal Auxiliary Network transferring, and present Spatial-Temporal Auxiliary (STAN) - a simple and effective temporal modeling Network (STAN) - a new temporal modeling method that mechanism extending CLIP model to diverse video tasks. is easy and effective for extending image-text pretrained Specifically, to realize both low-level and high-level knowledge model to diverse downstream video tasks.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found