Goto

Collaborating Authors

 space-time attention







A Simple Recipe for Contrastively Pre-training Video-First Encoders Beyond 16 Frames

Papalampidi, Pinelopi, Koppula, Skanda, Pathak, Shreya, Chiu, Justin, Heyward, Joe, Patraucean, Viorica, Shen, Jiajun, Miech, Antoine, Zisserman, Andrew, Nematzdeh, Aida

arXiv.org Artificial Intelligence

Understanding long, real-world videos requires modeling of long-range visual dependencies. To this end, we explore video-first architectures, building on the common paradigm of transferring large-scale, image--text models to video via shallow temporal fusion. However, we expose two limitations to the approach: (1) decreased spatial capabilities, likely due to poor video--language alignment in standard video datasets, and (2) higher memory consumption, bottlenecking the number of frames that can be processed. To mitigate the memory bottleneck, we systematically analyze the memory/accuracy trade-off of various efficient methods: factorized attention, parameter-efficient image-to-video adaptation, input masking, and multi-resolution patchification. Surprisingly, simply masking large portions of the video (up to 75%) during contrastive pre-training proves to be one of the most robust ways to scale encoders to videos up to 4.3 minutes at 1 FPS. Our simple approach for training long video-to-text models, which scales to 1B parameters, does not add new architectural complexity and is able to outperform the popular paradigm of using much larger LLMs as an information aggregator over segment-based information on benchmarks with long-range temporal dependencies (YouCook2, EgoSchema).


Space-Time Attention with Shifted Non-Local Search

Gauen, Kent, Chan, Stanley

arXiv.org Artificial Intelligence

Efficiently computing attention maps for videos is challenging due to the motion of objects between frames. While a standard non-local search is high-quality for a window surrounding each query point, the window's small size cannot accommodate motion. Methods for long-range motion use an auxiliary network to predict the most similar key coordinates as offsets from each query location. However, accurately predicting this flow field of offsets remains challenging, even for large-scale networks. Small spatial inaccuracies significantly impact the attention module's quality. This paper proposes a search strategy that combines the quality of a non-local search with the range of predicted offsets. The method, named Shifted Non-Local Search, executes a small grid search surrounding the predicted offsets to correct small spatial errors. Our method's in-place computation consumes 10 times less memory and is over 3 times faster than previous work. Experimentally, correcting the small spatial errors improves the video frame alignment quality by over 3 dB PSNR. Our search upgrades existing space-time attention modules, which improves video denoising results by 0.30 dB PSNR for a 7.5% increase in overall runtime. We integrate our space-time attention module into a UNet-like architecture to achieve state-of-the-art results on video denoising.


Space-time Mixing Attention for Video Transformer

Bulat, Adrian, Perez-Rua, Juan-Manuel, Sudhakaran, Swathikiran, Martinez, Brais, Tzimiropoulos, Georgios

arXiv.org Artificial Intelligence

This paper is on video recognition using Transformers. Very recent attempts in this area have demonstrated promising results in terms of recognition accuracy, yet they have been also shown to induce, in many cases, significant computational overheads due to the additional modelling of the temporal information. In this work, we propose a Video Transformer model the complexity of which scales linearly with the number of frames in the video sequence and hence induces no overhead compared to an image-based Transformer model. To achieve this, our model makes two approximations to the full space-time attention used in Video Transformers: (a) It restricts time attention to a local temporal window and capitalizes on the Transformer's depth to obtain full temporal coverage of the video sequence. (b) It uses efficient space-time mixing to attend jointly spatial and temporal locations without inducing any additional cost on top of a spatial-only attention model. We also show how to integrate 2 very lightweight mechanisms for global temporal-only attention which provide additional accuracy improvements at minimal computational cost. We demonstrate that our model produces very high recognition accuracy on the most popular video recognition datasets while at the same time being significantly more efficient than other Video Transformer models. Code will be made available.


Facebook AI Introduces TimeSformer: A New Video Architecture Based Purely On Transformers

#artificialintelligence

Facebook AI has built a new architecture for video understanding called TimeSformer. The video architecture is purely based on Transformers. Transformers have become the dominant approach for many natural language processing (NLP) applications such as Machine Translation and General language understanding. TimeSformer was proven to achieve the best-reported numbers on multiple challenging action recognition benchmarks, including the Kinetics-400 action recognition data set. Compared with modern 3D convolutional neural networks, it is nearly three times faster to train requires less than one-tenth of computing inference.