VideoBERT: A Joint Model for Video and Language Representation Learning
Sun, Chen, Myers, Austin, Vondrick, Carl, Murphy, Kevin, Schmid, Cordelia
–arXiv.org Artificial Intelligence
Self-supervised learning has become increasingly important Deep learning can benefit a lot from labeled data [23], to leverage the abundance of unlabeled data available but this is hard to acquire at scale. Consequently there has on platforms like YouTube. Whereas most existing been a lot of recent interest in "self supervised learning", approaches learn low-level representations, we propose a where we train a model on various "proxy tasks", which joint visual-linguistic model to learn high-level features we hope will result in the discovery of features or representations without any explicit supervision. In particular, inspired that can be used in downstream tasks (see e.g., by its recent success in language modeling, we build upon [22]). A wide variety of such proxy tasks have been proposed the BERT model to learn bidirectional joint distributions in the image and video domains. However, most of over sequences of visual and linguistic tokens, derived from these methods focus on low level features (e.g., textures) vector quantization of video data and off-the-shelf speech and short temporal scales (e.g., motion patterns that last a recognition outputs, respectively. We use this model in a second or less). We are interested in discovering high-level number of tasks, including action classification and video semantic features which correspond to actions and events captioning. We show that it can be applied directly to openvocabulary that unfold over longer time scales (e.g.
arXiv.org Artificial Intelligence
Apr-3-2019