video clip
- Asia > Philippines > Luzon > National Capital Region > City of Manila (0.14)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- North America > United States > Washington > King County > Seattle (0.04)
- (22 more...)
- Education > Curriculum > Subject-Specific Education (0.96)
- Health & Medicine (0.69)
EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding
We introduce EgoSchema, a very long-form video question-answering dataset, and benchmark to evaluate long video understanding capabilities of modern vision and language systems. Derived from Ego4D, EgoSchema consists of over 5000 human curated multiple choice question answer pairs, spanning over 250 hours of real video data, covering a very broad range of natural human activity and behavior. For each question, EgoSchema requires the correct answer to be selected between five given options based on a three-minute-long video clip. While some prior works have proposed video datasets with long clip lengths, we posit that merely the length of the video clip does not truly capture the temporal difficulty of the video task that is being considered. To remedy this, we introduce temporal certificate sets, a general notion for capturing the intrinsic temporal understanding length associated with a broad range of video understanding tasks & datasets.
Dynamic Normalization and Relay for Video Action Recognition
Convolutional Neural Networks (CNNs) have been the dominant model for video action recognition. Due to the huge memory and compute demand, popular action recognition networks need to be trained with small batch sizes, which makes learning discriminative spatial-temporal representations for videos become a challenging problem. In this paper, we present Dynamic Normalization and Relay (DNR), an improved normalization design, to augment the spatial-temporal representation learning of any deep action recognition model, adapting to small batch size training settings. We observe that state-of-the-art action recognition networks usually apply the same normalization parameters to all video data, and ignore the dependencies of the estimated normalization parameters between neighboring frames (at the same layer) and between neighboring layers (with all frames of a video clip). Inspired by this, DNR introduces two dynamic normalization relay modules to explore the potentials of cross-temporal and cross-layer feature distribution dependencies for estimating accurate layer-wise normalization parameters. These two DNR modules are instantiated as a light-weight recurrent structure conditioned on the current input features, and the normalization parameters estimated from the neighboring frames based features at the same layer or from the whole video clip based features at the preceding layers. We first plug DNR into prevailing 2D CNN backbones and test its performance on public action recognition datasets including Kinetics and Something-Something. Experimental results show that DNR brings large performance improvements to the baselines, achieving over 4.4% absolute margins in top-1 accuracy without training bells and whistles.
Supplementary Materials for V
In this appendix, we start with describing the experimental setup details (Sec. Each sampled video from 30K has on average around 100 clips. To investigate whether the additional MLP distillation head (Sec.3.3 in the main paper) affects the As we see in Table 1, for both NST and CRD, the performance drops on all downstream tasks when distillation heads are removed. Table 1: Ablation results of additional distillation heads for student language models.SST -2 QNLI QQP MNLI BERT In Table 2, we compare the accuracy of text-only pretraining, image-based KD and video-based KD on PIQA. KD further improves the results.
- North America > United States (0.04)
- Europe > United Kingdom (0.04)
- Europe > Germany (0.04)
- Asia > South Korea (0.04)
- Asia > Japan > Honshū > Chūbu > Ishikawa Prefecture > Kanazawa (0.05)
- Europe > United Kingdom > North Sea > Southern North Sea (0.04)
- North America > United States > Illinois (0.04)