Goto

Collaborating Authors

 Myers, Austin


Neptune: The Long Orbit to Benchmarking Long Video Understanding

arXiv.org Artificial Intelligence

This paper describes a semi-automatic pipeline to generate challenging question-answer-decoy sets for understanding long videos. Many existing video datasets and models are focused on short clips (10s-30s). While some long video datasets do exist, they can often be solved by powerful image models applied per frame (and often to very few frames) in a video, and are usually manually annotated at high cost. In order to mitigate both these problems, we propose a scalable dataset creation pipeline which leverages large models (VLMs and LLMs), to automatically generate dense, time-aligned video captions, as well as tough question answer decoy sets for video segments (up to 15 minutes in length). Our dataset Neptune covers a broad range of long video reasoning abilities and consists of a subset that emphasizes multimodal reasoning. Since existing metrics for open-ended question answering are either rule-based or may rely on proprietary models, we provide a new open source model-based metric GEM to score open-ended responses on Neptune. Benchmark evaluations reveal that most current open-source long video models perform poorly on Neptune, particularly on questions testing temporal ordering, counting and state changes. Through Neptune, we aim to spur the development of more advanced models capable of understanding long videos. The dataset is available at https://github.com/google-deepmind/neptune


IC3: Image Captioning by Committee Consensus

arXiv.org Artificial Intelligence

If you ask a human to describe an image, they might do so in a thousand different ways. Traditionally, image captioning models are trained to generate a single "best" (most like a reference) image caption. Unfortunately, doing so encourages captions that are "informationally impoverished," and focus on only a subset of the possible details, while ignoring other potentially useful information in the scene. In this work, we introduce a simple, yet novel, method: "Image Captioning by Committee Consensus" (IC3), designed to generate a single caption that captures high-level details from several annotator viewpoints. Humans rate captions produced by IC3 at least as helpful as baseline SOTA models more than two thirds of the time, and IC3 can improve the performance of SOTA automated recall systems by up to 84%, outperforming single human-generated reference captions, and indicating significant improvements over SOTA approaches for visual description. Code is available at https://davidmchan.github.io/caption-by-committee/


What's in a Caption? Dataset-Specific Linguistic Diversity and Its Effect on Visual Description Models and Metrics

arXiv.org Artificial Intelligence

While there have been significant gains in the field of automated video description, the generalization performance of automated description models to novel domains remains a major barrier to using these systems in the real world. Most visual description methods are known to capture and exploit patterns in the training data leading to evaluation metric increases, but what are those patterns? In this work, we examine several popular visual description datasets, and capture, analyze, and understand the dataset-specific linguistic patterns that models exploit but do not generalize to new domains. At the token level, sample level, and dataset level, we find that caption diversity is a major driving factor behind the generation of generic and uninformative captions. We further show that state-of-the-art models even outperform held-out ground truth captions on modern metrics, and that this effect is an artifact of linguistic diversity in datasets. Understanding this linguistic diversity is key to building strong captioning models, we recommend several methods and approaches for maintaining diversity in the collection of new data, and dealing with the consequences of limited diversity when using current models and metrics.


VideoBERT: A Joint Model for Video and Language Representation Learning

arXiv.org Artificial Intelligence

Self-supervised learning has become increasingly important Deep learning can benefit a lot from labeled data [23], to leverage the abundance of unlabeled data available but this is hard to acquire at scale. Consequently there has on platforms like YouTube. Whereas most existing been a lot of recent interest in "self supervised learning", approaches learn low-level representations, we propose a where we train a model on various "proxy tasks", which joint visual-linguistic model to learn high-level features we hope will result in the discovery of features or representations without any explicit supervision. In particular, inspired that can be used in downstream tasks (see e.g., by its recent success in language modeling, we build upon [22]). A wide variety of such proxy tasks have been proposed the BERT model to learn bidirectional joint distributions in the image and video domains. However, most of over sequences of visual and linguistic tokens, derived from these methods focus on low level features (e.g., textures) vector quantization of video data and off-the-shelf speech and short temporal scales (e.g., motion patterns that last a recognition outputs, respectively. We use this model in a second or less). We are interested in discovering high-level number of tasks, including action classification and video semantic features which correspond to actions and events captioning. We show that it can be applied directly to openvocabulary that unfold over longer time scales (e.g.