VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners

Yan, Shen, Zhu, Tao, Wang, Zirui, Cao, Yuan, Zhang, Mi, Ghosh, Soham, Wu, Yonghui, Yu, Jiahui

Mar-15-2023–arXiv.org Artificial Intelligence

Given a well-pretrained imagetext reuses a pretrained image-text contrastive captioner foundation model, it is natural to question whether any (CoCa) model and adapt it to video-text tasks with minimal heavy video-specific adaptor or many video-specific data is extra training. While previous works adapt image-text needed when transferring to video-text modelling models with various cross-frame fusion modules, we find In this paper, we explore an efficient approach to establish that the generative attentional pooling and contrastive attentional a foundational video-text model for tasks including pooling layers in CoCa are instantly adaptable to open-vocabulary video classification, text-to-video retrieval, flattened frame embeddings, yielding state-of-the-art results video captioning and video question-answering. We on zero-shot video classification and zero-shot text-to-video present VideoCoCa, a minimalist approach that extends retrieval. Furthermore, we explore lightweight finetuning the image-text contrastive captioners (CoCa) [68] to videotext on top of VideoCoCa, and achieve strong results on video tasks. The design principle of VideoCoCa is to maximally question-answering and video captioning.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

Mar-15-2023

arXiv.org PDF

Add feedback

Genre:
- Research Report (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (0.86)
  - Vision > Video Understanding (0.56)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found