AITopics | video action recognition

Compressed videos offer a compelling alternative to raw videos, showing the possibility to significantly reduce the on-line computational and storage cost.

artificial intelligence, machine learning, natural language, (14 more...)

Neural Information Processing Systems

Country:

Asia > China > Beijing > Beijing (0.04)
Asia > Middle East > Republic of Türkiye > Karaman Province > Karaman (0.04)
Asia > China > Guangxi Province > Nanning (0.04)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.93)

Add feedback

5bd529d5b07b647a8863cf71e98d651a-Paper.pdf

Neural Information Processing SystemsFeb-8-2026, 20:53:22 GMT

action recognition, normalization parameter, recognition, (14 more...)

Neural Information Processing Systems

Country: Asia > China (0.04)

Genre: Research Report (0.68)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

3776558654d8db1bfcb9ebde0e01184e-Supplemental-Conference.pdf

Neural Information Processing SystemsFeb-8-2026, 07:56:38 GMT

Wethus add more parameters in the head network and see ifthis could close the gap. As UPerNet has anFPN-likehead network, we 1 add parameters by replacing FPN with BiFPN. Fromthisfigure,wecan observethat the features across heads inthe Transformer decoder are almost the same. Semantic Segmentation on ADE20KFor the semantic segmentation task, we adopt widelyused ADE20K [11] as the benchmark. Table 7: Hyperparameters for the frozen setting and full finetuning on Kinetics-400 video action recognition.

artificial intelligence, batch frozen fullft, frozen fullft, (12 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Vision (0.52)

Add feedback

3776558654d8db1bfcb9ebde0e01184e-Paper-Conference.pdf

Neural Information Processing SystemsFeb-8-2026, 07:56:35 GMT

action recognition, arxiv preprint arxiv, recognition, (15 more...)

Neural Information Processing Systems

Country:

Europe > Romania > Sud - Muntenia Development Region > Giurgiu County > Giurgiu (0.04)
Asia > China > Shaanxi Province > Xi'an (0.04)
Asia > China > Guangxi Province > Nanning (0.04)

Genre: Research Report (0.68)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Sensing and Signal Processing > Image Processing (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

OmniVL: OneFoundationModelforImage-Language andVideo-Language Tasks

Neural Information Processing SystemsFeb-7-2026, 23:04:40 GMT

This paper presents OmniVL, a new foundation model to support both imagelanguage and video-language tasks using one universal architecture.

machine learning, natural language, wang, (17 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

CAST: Cross-Attention in Space and Time for Video Action Recognition

Neural Information Processing SystemsDec-27-2025, 06:39:25 GMT

Recognizing human actions in videos requires spatial and temporal understanding. Most existing action recognition models lack a balanced spatio-temporal understanding of videos. In this work, we propose a novel two-stream architecture, called Cross-Attention in Space and Time (CAST), that achieves a balanced spatio-temporal understanding of videos using only RGB input. Our proposed bottleneck cross-attention mechanism enables the spatial and temporal expert models to exchange information and make synergistic predictions, leading to improved performance. We validate the proposed method with extensive experiments on public benchmarks with different characteristics: EPIC-Kitchens-100, Something-Something-V2, and Kinetics-400. Our method consistently shows favorable performance across these datasets, while the performance of existing methods fluctuates depending on the dataset characteristics. The code is available at https://github.com/KHU-VLL/CAST.

name change, space and time, video action recognition, (1 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence (0.45)

Add feedback

Alignment-guided Temporal Attention for Video Action Recognition

Neural Information Processing SystemsDec-24-2025, 06:13:06 GMT

Temporal modeling is crucial for various video learning tasks. Most recent approaches employ either factorized (2D+1D) or joint (3D) spatial-temporal operations to extract temporal contexts from the input frames. While the former is more efficient in computation, the latter often obtains better performance. In this paper, we attribute this to a dilemma between the sufficiency and the efficiency of interactions among various positions in different frames. These interactions affect the extraction of task-relevant information shared among frames. To resolve this issue, we prove that frame-by-frame alignments have the potential to increase the mutual information between frame representations, thereby including more task-relevant information to boost effectiveness. Then we propose Alignment-guided Temporal Attention (ATA) to extend 1-dimensional temporal attention with parameter-free patch-level alignments between neighboring frames. It can act as a general plug-in for image backbones to conduct the action recognition task without any model-specific design. Extensive experiments on multiple benchmarks demonstrate the superiority and generality of our module.

alignment-guided temporal attention, name change, video action recognition, (4 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.40)

Add feedback

Dynamic Normalization and Relay for Video Action Recognition

Neural Information Processing SystemsDec-24-2025, 04:18:29 GMT

Convolutional Neural Networks (CNNs) have been the dominant model for video action recognition. Due to the huge memory and compute demand, popular action recognition networks need to be trained with small batch sizes, which makes learning discriminative spatial-temporal representations for videos become a challenging problem. In this paper, we present Dynamic Normalization and Relay (DNR), an improved normalization design, to augment the spatial-temporal representation learning of any deep action recognition model, adapting to small batch size training settings. We observe that state-of-the-art action recognition networks usually apply the same normalization parameters to all video data, and ignore the dependencies of the estimated normalization parameters between neighboring frames (at the same layer) and between neighboring layers (with all frames of a video clip). Inspired by this, DNR introduces two dynamic normalization relay modules to explore the potentials of cross-temporal and cross-layer feature distribution dependencies for estimating accurate layer-wise normalization parameters. These two DNR modules are instantiated as a light-weight recurrent structure conditioned on the current input features, and the normalization parameters estimated from the neighboring frames based features at the same layer or from the whole video clip based features at the preceding layers. We first plug DNR into prevailing 2D CNN backbones and test its performance on public action recognition datasets including Kinetics and Something-Something. Experimental results show that DNR brings large performance improvements to the baselines, achieving over 4.4% absolute margins in top-1 accuracy without training bells and whistles.

dynamic normalization and relay, name change, normalization parameter, (6 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.84)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.59)

Add feedback

Collaborating Authors

video action recognition

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

cb3213ada48302953cb0f166464ab356-Paper.pdf

Revealing the unseen: Benchmarking video action recognition under occlusion

Compressed Video Prompt Tuning Bing Li1,2 Jiaxin Chen

5bd529d5b07b647a8863cf71e98d651a-Paper.pdf

3776558654d8db1bfcb9ebde0e01184e-Supplemental-Conference.pdf

3776558654d8db1bfcb9ebde0e01184e-Paper-Conference.pdf

OmniVL: OneFoundationModelforImage-Language andVideo-Language Tasks

CAST: Cross-Attention in Space and Time for Video Action Recognition

Alignment-guided Temporal Attention for Video Action Recognition

Dynamic Normalization and Relay for Video Action Recognition