AITopics | videomae

Collaborating Authors

videomae

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

tTake out?Ground truth: Put down a cheeseGround truth: Take out a sauce(a) Importance of spatialunderstanding(b) Importance of temporalunderstandingtPut in?ttPut in?Milk carton? Cheese? Ketchup?

Neural Information Processing SystemsApr-30-2026, 09:26:16 GMT

Recognizing human actions in videos requires spatial and temporal understanding. Most existing action recognition models lack a balanced spatio-temporal understanding of videos. In this work, we propose a novel two-stream architecture, called Cross-Attention in Space and Time (CAST), that achieves a balanced spatio-temporal understanding of videos using only RGB input. Our proposed bottleneck cross-attention mechanism enables the spatial and temporal expert models to exchange information and make synergistic predictions, leading to improved performance. We validate the proposed method with extensive experiments on public benchmarks with different characteristics: EPIC-KITCHENS-100, Something-Something-V2, and Kinetics-400. Our method consistently shows favorable performance across these datasets, while the performance of existing methods fluctuates depending on the dataset characteristics. The code is available at https://github.com/KHU-VLL/CAST.

artificial intelligence, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Add feedback

CAST: Cross-Attention in Space and Time for Video Action Recognition

Neural Information Processing SystemsFeb-18-2026, 02:21:05 GMT

Recognizing human actions in videos requires spatial and temporal understanding. Most existing action recognition models lack a balanced spatio-temporal understanding of videos.

artificial intelligence, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Add feedback

416f9cb3276121c42eebb86352a4354a-Supplemental-Conference.pdf

Neural Information Processing SystemsFeb-8-2026, 13:46:45 GMT

dataset, fine-tuning, videomae, (15 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

VideoMAE: MaskedAutoencodersareData-Efficient LearnersforSelf-SupervisedVideoPre-Training

Neural Information Processing SystemsFeb-8-2026, 13:46:41 GMT

Transformer [70]has brought significant progress in natural language processing [17,7,54]. The vision transformer [20] also improves a series of computer vision tasks including image classification [66,88], object detection [8,37], semantic segmentation [80], object tracking [13,16], and video recognition [6,3].

artificial intelligence, machine learning, natural language, (16 more...)

Neural Information Processing Systems

Country: Asia > China (0.04)

Genre: Research Report (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)

Add feedback

VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training

Neural Information Processing SystemsDec-24-2025, 02:41:45 GMT

Pre-training video transformers on extra large-scale datasets is generally required to achieve premier performance on relatively small datasets. In this paper, we show that video masked autoencoders (VideoMAE) are data-efficient learners for self-supervised video pre-training (SSVP). We are inspired by the recent ImageMAE and propose customized video tube masking with an extremely high ratio. This simple design makes video reconstruction a more challenging and meaningful self-supervision task, thus encouraging extracting more effective video representations during the pre-training process. We obtain three important findings with VideoMAE: (1) An extremely high proportion of masking ratio (i.e., 90% to 95%) still yields favorable performance for VideoMAE. The temporally redundant video content enables higher masking ratio than that of images.

data-efficient learner, masked autoencoder, videomae, (8 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Appendix

Neural Information Processing SystemsAug-14-2025, 10:19:19 GMT

In this appendix, we provide more details of VideoMAE from the following aspects: The detailed architecture illustration is in A. The implementation details are in B. Results analysis is in D. Visualization of reconstructed samples is in E. License of the datasets is in F. We take 16-frame vanilla ViT -Base for example. We conduct the experiments with 64 GPUs for both pre-training and fine-tuning on the Something-Something V2 and Kinetics-400 datasets. The experiments on the A V A dataset are conducted with 32 GPUs. For evaluation, all models share the same inference protocol, i.e., 2 clips Our VideoMAE is pre-trained for 800 epochs on Kinetics-400 by default. We follow a similar recipe on Kinetics for pre-training.

dataset, fine-tuning, videomae, (15 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training

Neural Information Processing SystemsAug-14-2025, 10:19:15 GMT

Notably, our VideoMAE with the vanilla ViT backbone can achieve 87.4% on Kinects-400, 75.4% on Something-Something V2, 91.3% on UCF101, and 62.6% on HMDB51, without using any

computer vision, dataset, videomae, (11 more...)

Neural Information Processing Systems

Country:

Asia > China > Shanghai > Shanghai (0.04)
Asia > China > Jiangsu Province > Nanjing (0.04)

Genre: Research Report > New Finding (0.93)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

TRecViT: A Recurrent Video Transformer

Pătrăucean, Viorica, He, Xu Owen, Heyward, Joseph, Zhang, Chuhan, Sajjadi, Mehdi S. M., Muraru, George-Cristian, Zholus, Artem, Karami, Mahdi, Goroshin, Ross, Chen, Yutian, Osindero, Simon, Carreira, João, Pascanu, Razvan

arXiv.org Artificial IntelligenceDec-18-2024

We propose a novel block for video modelling. It relies on a time-space-channel factorisation with dedicated blocks for each dimension: gated linear recurrent units (LRUs) perform information mixing over time, self-attention layers perform mixing over space, and MLPs over channels. The resulting architecture TRecViT performs well on sparse and dense tasks, trained in supervised or self-supervised regimes. Notably, our model is causal and outperforms or is on par with a pure attention model ViViT-L on large scale video datasets (SSv2, Kinetics400), while having $3\times$ less parameters, $12\times$ smaller memory footprint, and $5\times$ lower FLOPs count. Code and checkpoints will be made available online at https://github.com/google-deepmind/trecvit.

machine learning, natural language, trecvit, (20 more...)

arXiv.org Artificial Intelligence

2412.14294

Country: North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

When Spatial meets Temporal in Action Recognition

Chen, Huilin, Wang, Lei, Chen, Yifan, Gedeon, Tom, Koniusz, Piotr

arXiv.org Artificial IntelligenceNov-22-2024

Video action recognition has made significant strides, but challenges remain in effectively using both spatial and temporal information. While existing methods often focus on either spatial features (e.g., object appearance) or temporal dynamics (e.g., motion), they rarely address the need for a comprehensive integration of both. Capturing the rich temporal evolution of video frames, while preserving their spatial details, is crucial for improving accuracy. In this paper, we introduce the Temporal Integration and Motion Enhancement (TIME) layer, a novel preprocessing technique designed to incorporate temporal information. The TIME layer generates new video frames by rearranging the original sequence, preserving temporal order while embedding $N^2$ temporally evolving frames into a single spatial grid of size $N \times N$. This transformation creates new frames that balance both spatial and temporal information, making them compatible with existing video models. When $N=1$, the layer captures rich spatial details, similar to existing methods. As $N$ increases ($N\geq2$), temporal information becomes more prominent, while the spatial information decreases to ensure compatibility with model inputs. We demonstrate the effectiveness of the TIME layer by integrating it into popular action recognition models, such as ResNet-50, Vision Transformer, and Video Masked Autoencoders, for both RGB and depth video data. Our experiments show that the TIME layer enhances recognition accuracy, offering valuable insights for video processing tasks.

artificial intelligence, machine learning, time layer, (18 more...)

arXiv.org Artificial Intelligence

2411.15284

Country: Oceania > Australia (0.04)

Genre: Research Report > New Finding (0.68)

Industry: Leisure & Entertainment > Sports (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)

Add feedback

Multi-Modal Self-Supervised Learning for Surgical Feedback Effectiveness Assessment

Gupta, Arushi, Kocielnik, Rafal, Wang, Jiayun, Nasriddinov, Firdavs, Yang, Cherine, Wong, Elyssa, Anandkumar, Anima, Hung, Andrew

arXiv.org Artificial IntelligenceNov-16-2024

During surgical training, real-time feedback from trainers to trainees is important for preventing errors and enhancing long-term skill acquisition. Accurately predicting the effectiveness of this feedback, specifically whether it leads to a change in trainee behavior, is crucial for developing methods for improving surgical training and education. However, relying on human annotations to assess feedback effectiveness is laborious and prone to biases, underscoring the need for an automated, scalable, and objective method. Creating such an automated system poses challenges, as it requires an understanding of both the verbal feedback delivered by the trainer and the visual context of the real-time surgical scene. To address this, we propose a method that integrates information from transcribed verbal feedback and corresponding surgical video to predict feedback effectiveness. Our findings show that both transcribed feedback and surgical video are individually predictive of trainee behavior changes, and their combination achieves an AUROC of 0.70+/-0.02, improving prediction accuracy by up to 6.6%. Additionally, we introduce self-supervised fine-tuning as a strategy for enhancing surgical video representation learning, which is scalable and further enhances prediction performance. Our results demonstrate the potential of multi-modal learning to advance the automated assessment of surgical feedback.

artificial intelligence, fine-tuning, machine learning, (14 more...)

arXiv.org Artificial Intelligence

2411.10919

Country: