Goto

Collaborating Authors

 video classification model



Adversarial Attacks on Black Box Video Classifiers: Leveraging the Power of Geometric Transformations

Neural Information Processing Systems

When compared to the image classification models, black-box adversarial attacks against video classification models have been largely understudied. This could be possible because, with video, the temporal dimension poses significant additional challenges in gradient estimation. Query-efficient black-box attacks rely on effectively estimated gradients towards maximizing the probability of misclassifying the target video. In this work, we demonstrate that such effective gradients can be searched for by parameterizing the temporal structure of the search space with geometric transformations.


Camera Movement Classification in Historical Footage: A Comparative Study of Deep Video Models

Lin, Tingyu, Dadras, Armin, Kleber, Florian, Sablatnig, Robert

arXiv.org Artificial Intelligence

Camera movement conveys spatial and narrative information essential for understanding video content. While recent camera movement classification (CMC) methods perform well on modern datasets, their generalization to historical footage remains unexplored. This paper presents the first systematic evaluation of deep video CMC models on archival film material. We summarize representative methods and datasets, highlighting differences in model design and label definitions. Five standard video classification models are assessed on the HISTORIAN dataset, which includes expert-annotated World War II footage. The best-performing model, Video Swin Transformer, achieves 80.25% accuracy, showing strong convergence despite limited training data. Our findings highlight the challenges and potential of adapting existing models to low-quality video and motivate future work combining diverse input modalities and temporal architectures.



Enhancing Video-Based Robot Failure Detection Using Task Knowledge

Thoduka, Santosh, Houben, Sebastian, Gall, Juergen, Plöger, Paul G.

arXiv.org Artificial Intelligence

Robust robotic task execution hinges on the reliable detection of execution failures in order to trigger safe operation modes, recovery strategies, or task replanning. However, many failure detection methods struggle to provide meaningful performance when applied to a variety of real-world scenarios. In this paper, we propose a video-based failure detection approach that uses spatio-temporal knowledge in the form of the actions the robot performs and task-relevant objects within the field of view. Both pieces of information are available in most robotic scenarios and can thus be readily obtained. We demonstrate the effectiveness of our approach on three datasets that we amend, in part, with additional annotations of the aforementioned task-relevant knowledge. In light of the results, we also propose a data augmentation method that improves performance by applying variable frame rates to different parts of the video. We observe an improvement from 77.9 to 80.0 in F1 score on the ARMBench dataset without additional computational expense and an additional increase to 81.4 with test-time augmentation. The results emphasize the importance of spatio-temporal information during failure detection and suggest further investigation of suitable heuristics in future implementations. Code and annotations are available.


Adversarial Attacks on Black Box Video Classifiers: Leveraging the Power of Geometric Transformations

Neural Information Processing Systems

When compared to the image classification models, black-box adversarial attacks against video classification models have been largely understudied. This could be possible because, with video, the temporal dimension poses significant additional challenges in gradient estimation. Query-efficient black-box attacks rely on effectively estimated gradients towards maximizing the probability of misclassifying the target video. In this work, we demonstrate that such effective gradients can be searched for by parameterizing the temporal structure of the search space with geometric transformations. GEO-TRAP employs standard geometric transformation operations to reduce the search space for effective gradients into searching for a small group of parameters that define these operations.


Towards Gradient-based Time-Series Explanations through a SpatioTemporal Attention Network

Lee, Min Hun

arXiv.org Artificial Intelligence

However, it is not desirable to apply AI fully autonomously as wrong outcomes of AI models in high-stake domains could have serious impacts on people. Regardless of the performance of an AI model, the end-users desire to understand the evidence on the outcome of an AI model [35]. A growing body of research investigates how to generate explanations of an AI model and augment user's decision-making tasks [2, 18, 25]. Researchers have explored various techniques to make AI interpretable and explainable [15]. These explainable AI techniques can be broadly categorized into inherently interpretable models (e.g.


MASTAF: A Model-Agnostic Spatio-Temporal Attention Fusion Network for Few-shot Video Classification

Liu, Rex, Zhang, Huanle, Pirsiavash, Hamed, Liu, Xin

arXiv.org Artificial Intelligence

We propose MASTAF, a Model-Agnostic Spatio-Temporal Attention Fusion network for few-shot video classification. MASTAF takes input from a general video spatial and temporal representation,e.g., using 2D CNN, 3D CNN, and Video Transformer. Then, to make the most of such representations, we use self- and cross-attention models to highlight the critical spatio-temporal region to increase the inter-class variations and decrease the intra-class variations. Last, MASTAF applies a lightweight fusion network and a nearest neighbor classifier to classify each query video. We demonstrate that MASTAF improves the state-of-the-art performance on three few-shot video classification benchmarks(UCF101, HMDB51, and Something-Something-V2), e.g., by up to 91.6%, 69.5%, and 60.7% for five-way one-shot video classification, respectively.