Goto

Collaborating Authors

 video dataset



MOMA-LRG: Language-Refined Graphs for Multi-Object Multi-Actor Activity Parsing Supplementary Material

Neural Information Processing Systems

VLMEvaluation To evaluate two VLMs (Frozen in Time [1] and VideoCLIP [13]), we use a hybrid approach that leverages both prototypical networks [11] and the video-language similarity metrics learned by both models. Below, we show an ablation study where we use only the video prototype networks. We show the performance of using only language similarity in the few-shot case to demonstrate the effects of sample removal, and we also show the effects of our hybrid weighting scheme, where we weight the language embeddings five times more than the video embeddings when constructing the hybrid prototype (as opposed to equal weighting during the regular hybrid approach). Although we perform our ablation study with Frozen-in-Time, and use the same weighting scheme and prototype strategy for VideoCLIP as well. For this study, we show activity and sub-activity classification accuracy in the 5-shot case. We visualize whether a given method uses language, video, or both to create its prototype embeddings.









VideoMAE: MaskedAutoencodersareData-Efficient LearnersforSelf-SupervisedVideoPre-Training

Neural Information Processing Systems

Transformer [70]has brought significant progress in natural language processing [17,7,54]. The vision transformer [20] also improves a series of computer vision tasks including image classification [66,88], object detection [8,37], semantic segmentation [80], object tracking [13,16], and video recognition [6,3].