VideoBERT: A Joint Model for Video and Language Representation Learning

Sun, Chen, Myers, Austin, Vondrick, Carl, Murphy, Kevin, Schmid, Cordelia

Apr-3-2019–arXiv.org Artificial Intelligence

Self-supervised learning has become increasingly important Deep learning can benefit a lot from labeled data [23], to leverage the abundance of unlabeled data available but this is hard to acquire at scale. Consequently there has on platforms like YouTube. Whereas most existing been a lot of recent interest in "self supervised learning", approaches learn low-level representations, we propose a where we train a model on various "proxy tasks", which joint visual-linguistic model to learn high-level features we hope will result in the discovery of features or representations without any explicit supervision. In particular, inspired that can be used in downstream tasks (see e.g., by its recent success in language modeling, we build upon [22]). A wide variety of such proxy tasks have been proposed the BERT model to learn bidirectional joint distributions in the image and video domains. However, most of over sequences of visual and linguistic tokens, derived from these methods focus on low level features (e.g., textures) vector quantization of video data and off-the-shelf speech and short temporal scales (e.g., motion patterns that last a recognition outputs, respectively. We use this model in a second or less). We are interested in discovering high-level number of tasks, including action classification and video semantic features which correspond to actions and events captioning. We show that it can be applied directly to openvocabulary that unfold over longer time scales (e.g.

deep learning, neural network, video, (19 more...)

arXiv.org Artificial Intelligence

Apr-3-2019

arXiv.org PDF

Add feedback

Genre:
- Research Report (0.50)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning
    - Inductive Learning (0.54)
    - Neural Networks > Deep Learning (0.48)
  - Natural Language
    - Machine Translation (0.68)
    - Text Processing (0.68)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found