VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners

Yan, Shen, Zhu, Tao, Wang, Zirui, Cao, Yuan, Zhang, Mi, Ghosh, Soham, Wu, Yonghui, Yu, Jiahui

arXiv.org Artificial Intelligence 

Given a well-pretrained imagetext reuses a pretrained image-text contrastive captioner foundation model, it is natural to question whether any (CoCa) model and adapt it to video-text tasks with minimal heavy video-specific adaptor or many video-specific data is extra training. While previous works adapt image-text needed when transferring to video-text modelling models with various cross-frame fusion modules, we find In this paper, we explore an efficient approach to establish that the generative attentional pooling and contrastive attentional a foundational video-text model for tasks including pooling layers in CoCa are instantly adaptable to open-vocabulary video classification, text-to-video retrieval, flattened frame embeddings, yielding state-of-the-art results video captioning and video question-answering. We on zero-shot video classification and zero-shot text-to-video present VideoCoCa, a minimalist approach that extends retrieval. Furthermore, we explore lightweight finetuning the image-text contrastive captioners (CoCa) [68] to videotext on top of VideoCoCa, and achieve strong results on video tasks. The design principle of VideoCoCa is to maximally question-answering and video captioning.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found