AITopics | space-time attention

Collaborating Authors

space-time attention

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Space-timeMixingAttentionforVideoTransformer

Neural Information Processing SystemsFeb-10-2026, 09:59:44 GMT

This paper is on video recognition using Transformers.

artificial intelligence, arxivpreprintarxiv, machine learning, (17 more...)

Neural Information Processing Systems

Genre: Research Report (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Supplementary material for Space-time Mixing Attention for Video Transformer

Neural Information Processing SystemsAug-16-2025, 13:21:12 GMT

The results of Table 2 clearly show that the two approaches are different.

aggregation, artificial intelligence, space-time mixing attention, (12 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Vision (0.55)

Add feedback

Space-time Mixing Attention for Video Transformer

Neural Information Processing SystemsAug-16-2025, 13:21:08 GMT

This paper is on video recognition using Transformers.

arxiv preprint arxiv, machine learning, space-time attention, (17 more...)

Neural Information Processing Systems

Country: South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)

Add feedback

Keeping Y our Eye on the Ball: Trajectory Attention in Video Transformers Mandela Patrick

Neural Information Processing SystemsAug-14-2025, 22:38:59 GMT

The self-attention mechanism in the transformer works well for different types of data and across domains.

prototype, trajectory attention, transformer, (15 more...)

Neural Information Processing Systems

Country: Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Sensing and Signal Processing > Image Processing (0.93)
(2 more...)

Add feedback

A Simple Recipe for Contrastively Pre-training Video-First Encoders Beyond 16 Frames

Papalampidi, Pinelopi, Koppula, Skanda, Pathak, Shreya, Chiu, Justin, Heyward, Joe, Patraucean, Viorica, Shen, Jiajun, Miech, Antoine, Zisserman, Andrew, Nematzdeh, Aida

arXiv.org Artificial IntelligenceDec-12-2023

Understanding long, real-world videos requires modeling of long-range visual dependencies. To this end, we explore video-first architectures, building on the common paradigm of transferring large-scale, image--text models to video via shallow temporal fusion. However, we expose two limitations to the approach: (1) decreased spatial capabilities, likely due to poor video--language alignment in standard video datasets, and (2) higher memory consumption, bottlenecking the number of frames that can be processed. To mitigate the memory bottleneck, we systematically analyze the memory/accuracy trade-off of various efficient methods: factorized attention, parameter-efficient image-to-video adaptation, input masking, and multi-resolution patchification. Surprisingly, simply masking large portions of the video (up to 75%) during contrastive pre-training proves to be one of the most robust ways to scale encoders to videos up to 4.3 minutes at 1 FPS. Our simple approach for training long video-to-text models, which scales to 1B parameters, does not add new architectural complexity and is able to outperform the popular paradigm of using much larger LLMs as an information aggregator over segment-based information on benchmarks with long-range temporal dependencies (YouCook2, EgoSchema).

benchmark, encoder, video, (13 more...)

arXiv.org Artificial Intelligence

2312.07395

Country:

South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
Europe > Romania > Sud - Muntenia Development Region > Giurgiu County > Giurgiu (0.04)
Asia > Middle East > Jordan (0.04)
Asia > Middle East > Israel > Tel Aviv District > Tel Aviv (0.04)

Genre: Research Report > New Finding (0.46)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Add feedback

Space-Time Attention with Shifted Non-Local Search

Gauen, Kent, Chan, Stanley

arXiv.org Artificial IntelligenceDec-4-2023

Efficiently computing attention maps for videos is challenging due to the motion of objects between frames. While a standard non-local search is high-quality for a window surrounding each query point, the window's small size cannot accommodate motion. Methods for long-range motion use an auxiliary network to predict the most similar key coordinates as offsets from each query location. However, accurately predicting this flow field of offsets remains challenging, even for large-scale networks. Small spatial inaccuracies significantly impact the attention module's quality. This paper proposes a search strategy that combines the quality of a non-local search with the range of predicted offsets. The method, named Shifted Non-Local Search, executes a small grid search surrounding the predicted offsets to correct small spatial errors. Our method's in-place computation consumes 10 times less memory and is over 3 times faster than previous work. Experimentally, correcting the small spatial errors improves the video frame alignment quality by over 3 dB PSNR. Our search upgrades existing space-time attention modules, which improves video denoising results by 0.30 dB PSNR for a 7.5% increase in overall runtime. We integrate our space-time attention module into a UNet-like architecture to achieve state-of-the-art results on video denoising.

attention module, shifted non-local search, video, (12 more...)

arXiv.org Artificial Intelligence

2309.16849

Country: Europe > Sweden > Halland County > Halmstad (0.04)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)

Add feedback

Space-time Mixing Attention for Video Transformer

Bulat, Adrian, Perez-Rua, Juan-Manuel, Sudhakaran, Swathikiran, Martinez, Brais, Tzimiropoulos, Georgios

arXiv.org Artificial IntelligenceJun-11-2021

This paper is on video recognition using Transformers. Very recent attempts in this area have demonstrated promising results in terms of recognition accuracy, yet they have been also shown to induce, in many cases, significant computational overheads due to the additional modelling of the temporal information. In this work, we propose a Video Transformer model the complexity of which scales linearly with the number of frames in the video sequence and hence induces no overhead compared to an image-based Transformer model. To achieve this, our model makes two approximations to the full space-time attention used in Video Transformers: (a) It restricts time attention to a local temporal window and capitalizes on the Transformer's depth to obtain full temporal coverage of the video sequence. (b) It uses efficient space-time mixing to attend jointly spatial and temporal locations without inducing any additional cost on top of a spatial-only attention model. We also show how to integrate 2 very lightweight mechanisms for global temporal-only attention which provide additional accuracy improvements at minimal computational cost. We demonstrate that our model produces very high recognition accuracy on the most popular video recognition datasets while at the same time being significantly more efficient than other Video Transformer models. Code will be made available.

arxiv preprint arxiv, recognition, space-time attention, (13 more...)

arXiv.org Artificial Intelligence

2106.05968

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Add feedback

Facebook AI Introduces TimeSformer: A New Video Architecture Based Purely On Transformers

#artificialintelligenceMar-17-2021, 12:53:00 GMT

Facebook AI has built a new architecture for video understanding called TimeSformer. The video architecture is purely based on Transformers. Transformers have become the dominant approach for many natural language processing (NLP) applications such as Machine Translation and General language understanding. TimeSformer was proven to achieve the best-reported numbers on multiple challenging action recognition benchmarks, including the Kinetics-400 action recognition data set. Compared with modern 3D convolutional neural networks, it is nearly three times faster to train requires less than one-tenth of computing inference.

timesformer, transformer, video, (8 more...)

#artificialintelligence

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.57)
Information Technology > Artificial Intelligence > Vision > Video Understanding (0.39)

Add feedback

Filters

Collaborating Authors

space-time attention

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

Space-timeMixingAttentionforVideoTransformer

67f7fb873eaf29526a11a9b7ac33bfac-Paper.pdf

Supplementary material for Space-time Mixing Attention for Video Transformer

Space-time Mixing Attention for Video Transformer

Keeping Y our Eye on the Ball: Trajectory Attention in Video Transformers Mandela Patrick

A Simple Recipe for Contrastively Pre-training Video-First Encoders Beyond 16 Frames

Space-Time Attention with Shifted Non-Local Search

Space-time Mixing Attention for Video Transformer

Facebook AI Introduces TimeSformer: A New Video Architecture Based Purely On Transformers