AITopics | timesformer

Collaborating Authors

timesformer

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Round Outcome Prediction in VALORANT Using Tactical Features from Video Analysis

Hayakawa, Nirai, Shimari, Kazumasa, Yamasaki, Kazuma, Hoshikawa, Hirotatsu, Tsuchida, Rikuto, Matsumoto, Kenichi

arXiv.org Artificial IntelligenceOct-21-2025

Recently, research on predicting match outcomes in esports has been actively conducted, but much of it is based on match log data and statistical information. This research targets the FPS game VALORANT, which requires complex strategies, and aims to build a round outcome prediction model by analyzing minimap information in match footage. Specifically, based on the video recognition model TimeSformer, we attempt to improve prediction accuracy by incorporating detailed tactical features extracted from minimap information, such as character position information and other in-game events. This paper reports preliminary results showing that a model trained on a dataset augmented with such tactical event labels achieved approximately 81% prediction accuracy, especially from the middle phases of a round onward, significantly outperforming a model trained on a dataset with the minimap information itself. This suggests that leveraging tactical features from match footage is highly effective for predicting round outcomes in VALORANT.

artificial intelligence, information, machine learning, (13 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/CoG64752.2025.11114177

2510.17199

Country: Asia > Japan (0.14)

Genre: Research Report > New Finding (1.00)

Industry: Leisure & Entertainment > Games > Computer Games (0.95)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.95)

Add feedback

Aligning Video Models with Human Social Judgments via Behavior-Guided Fine-Tuning

Garcia, Kathy, Isik, Leyla

arXiv.org Artificial IntelligenceOct-3-2025

Humans intuitively perceive complex social signals in visual scenes, yet it remains unclear whether state-of-the-art AI models encode the same similarity structure. We study (Q1) whether modern video and language models capture human-perceived similarity in social videos, and (Q2) how to instill this structure into models using human behavioral data. To address this, we introduce a new benchmark of over 49,000 odd-one-out similarity judgments on 250 three-second video clips of social interactions, and discover a modality gap: despite the task being visual, caption-based language embeddings align better with human similarity than any pretrained video model. We close this gap by fine-tuning a TimeSformer video model on these human judgments with our novel hybrid triplet-RSA objective using low-rank adaptation (LoRA), aligning pairwise distances to human similarity. This fine-tuning protocol yields significantly improved alignment with human perceptions on held-out videos in terms of both explained variance and odd-one-out triplet accuracy. Variance partitioning shows that the fine-tuned video model increases shared variance with language embeddings and explains additional unique variance not captured by the language model. Finally, we test transfer via linear probes and find that human-similarity fine-tuning strengthens the encoding of social-affective attributes (intimacy, valence, dominance, communication) relative to the pretrained baseline. Overall, our findings highlight a gap in pretrained video models' social recognition and demonstrate that behavior-guided fine-tuning shapes video representations toward human social perception.

large language model, machine learning, urlhttp, (19 more...)

arXiv.org Artificial Intelligence

2510.01502

Genre: Research Report > New Finding (0.66)

Technology:

Information Technology > Artificial Intelligence > Cognitive Science (0.89)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Context-Aware Zero-Shot Anomaly Detection in Surveillance Using Contrastive and Predictive Spatiotemporal Modeling

Khan, Md. Rashid Shahriar, Hasan, Md. Abrar, Justice, Mohammod Tareq Aziz

arXiv.org Artificial IntelligenceAug-28-2025

Detecting anomalies in surveillance footage is inherently challenging due to their unpredictable and context-dependent nature. This work introduces a novel context-aware zero-shot anomaly detection framework that identifies abnormal events without exposure to anomaly examples during training. The proposed hybrid architecture combines TimeSformer, DPC, and CLIP to model spatiotemporal dynamics and semantic context. TimeSformer serves as the vision backbone to extract rich spatial-temporal features, while DPC forecasts future representations to identify temporal deviations. Furthermore, a CLIP-based semantic stream enables concept-level anomaly detection through context-specific text prompts. These components are jointly trained using InfoNCE and CPC losses, aligning visual inputs with their temporal and semantic representations. A context-gating mechanism further enhances decision-making by modulating predictions with scene-aware cues or global video features. By integrating predictive modeling with vision-language understanding, the system can generalize to previously unseen behaviors in complex environments. This framework bridges the gap between temporal reasoning and semantic context in zero-shot anomaly detection for surveillance. The code for this research has been made available at https://github.com/NK-II/Context-Aware-Zero-Shot-Anomaly-Detection-in-Surveillance.

data mining, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2508.18463

Genre: Research Report (0.40)

Technology:

Information Technology > Data Science > Data Mining > Anomaly Detection (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.68)

Add feedback

Three-dimensional attention Transformer for state evaluation in real-time strategy games

Ye, Yanqing, Yang, Weilong, Qiu, Kai, Zhang, Jie

arXiv.org Artificial IntelligenceJan-7-2025

Situation assessment in Real-Time Strategy (RTS) games is crucial for understanding decision-making in complex adversarial environments. However, existing methods remain limited in processing multi-dimensional feature information and temporal dependencies. Here we propose a tri-dimensional Space-Time-Feature Transformer (TSTF Transformer) architecture, which efficiently models battlefield situations through three independent but cascaded modules: spatial attention, temporal attention, and feature attention. On a dataset comprising 3,150 adversarial experiments, the 8-layer TSTF Transformer demonstrates superior performance: achieving 58.7% accuracy in the early game (~4% progress), significantly outperforming the conventional Timesformer's 41.8%; reaching 97.6% accuracy in the mid-game (~40% progress) while maintaining low performance variation (standard deviation 0.114). Meanwhile, this architecture requires fewer parameters (4.75M) compared to the baseline model (5.54M). Our study not only provides new insights into situation assessment in RTS games but also presents an innovative paradigm for Transformer-based multi-dimensional temporal modeling.

accuracy, transformer, tstf transformer, (14 more...)

arXiv.org Artificial Intelligence

2501.03832

Country:

Asia > China > Beijing > Beijing (0.05)
North America > United States > North Carolina > Wake County > Raleigh (0.04)
North America > United States > Massachusetts > Suffolk County > Boston (0.04)
North America > United States > California > Santa Clara County > Stanford (0.04)

Genre: Research Report (1.00)

Industry: Leisure & Entertainment > Games > Computer Games (0.84)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(2 more...)

Add feedback

TikGuard: A Deep Learning Transformer-Based Solution for Detecting Unsuitable TikTok Content for Kids

Balat, Mazen, Gabr, Mahmoud Essam, Bakr, Hend, Zaky, Ahmed B.

arXiv.org Artificial IntelligenceOct-1-2024

The rise of short-form videos on platforms like TikTok has brought new challenges in safeguarding young viewers from inappropriate content. Traditional moderation methods often fall short in handling the vast and rapidly changing landscape of user-generated videos, increasing the risk of children encountering harmful material. This paper introduces TikGuard, a transformer-based deep learning approach aimed at detecting and flagging content unsuitable for children on TikTok. By using a specially curated dataset, TikHarm, and leveraging advanced video classification techniques, TikGuard achieves an accuracy of 86.7%, showing a notable improvement over existing methods in similar contexts. While direct comparisons are limited by the uniqueness of the TikHarm dataset, TikGuard's performance highlights its potential in enhancing content moderation, contributing to a safer online experience for minors. This study underscores the effectiveness of transformer models in video classification and sets a foundation for future research in this area.

accuracy, dataset, video, (12 more...)

arXiv.org Artificial Intelligence

2410.00403

Country:

Africa > Middle East > Egypt > Alexandria Governorate > Alexandria (0.05)
Asia > Japan (0.04)
Asia > China (0.04)

Genre: Research Report (1.00)

Industry: Information Technology > Services (0.48)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Pig aggression classification using CNN, Transformers and Recurrent Networks

Souza, Junior Silva, Bedin, Eduardo, Higa, Gabriel Toshio Hirokawa, Loebens, Newton, Pistori, Hemerson

arXiv.org Artificial IntelligenceMar-13-2024

The development of techniques that can be used to analyze and detect animal behavior is a crucial activity for the livestock sector, as it is possible to monitor the stress and animal welfare and contributes to decision making in the farm. Thus, the development of applications can assist breeders in making decisions to improve production performance and reduce costs, once the animal behavior is analyzed by humans and this can lead to susceptible errors and time consumption. Aggressiveness in pigs is an example of behavior that is studied to reduce its impact through animal classification and identification. However, this process is laborious and susceptible to errors, which can be reduced through automation by visually classifying videos captured in controlled environment. The captured videos can be used for training and, as a result, for classification through computer vision and artificial intelligence, employing neural network techniques. The main techniques utilized in this study are variants of transformers: STAM, TimeSformer, and ViViT, as well as techniques using convolutions, such as ResNet3D2, Resnet(2+1)D, and CnnLstm. These techniques were employed for pig video classification with the objective of identifying aggressive and non-aggressive behaviors. In this work, various techniques were compared to analyze the contribution of using transformers, in addition to the effectiveness of the convolution technique in video classification. The performance was evaluated using accuracy, precision, and recall. The TimerSformer technique showed the best results in video classification, with median accuracy of 0.729.

classification, information, video, (15 more...)

arXiv.org Artificial Intelligence

2403.08528

Country:

South America > Brazil > Mato Grosso do Sul > Campo Grande (0.04)
Asia (0.04)

Genre: Research Report > New Finding (0.34)

Industry: Health & Medicine > Therapeutic Area > Psychiatry/Psychology (0.35)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Vision > Video Understanding (0.91)

Add feedback

Time Is MattEr: Temporal Self-supervision for Video Transformers

Yun, Sukmin, Kim, Jaehyung, Han, Dongyoon, Song, Hwanjun, Ha, Jung-Woo, Shin, Jinwoo

arXiv.org Artificial IntelligenceJul-19-2022

Understanding temporal dynamics of video is an essential aspect of learning better video representations. Recently, transformer-based architectural designs have been extensively explored for video tasks due to their capability to capture long-term dependency of input sequences. However, we found that these Video Transformers are still biased to learn spatial dynamics rather than temporal ones, and debiasing the spurious correlation is critical for their performance. Based on the observations, we design simple yet effective self-supervised tasks for video models to learn temporal dynamics better. Specifically, for debiasing the spatial bias, our method learns the temporal order of video frames as extra self-supervision and enforces the randomly shuffled frames to have low-confidence outputs. Also, our method learns the temporal flow direction of video tokens among consecutive frames for enhancing the correlation toward temporal dynamics. Under various video action recognition tasks, we demonstrate the effectiveness of our method and its compatibility with state-of-the-art Video Transformers.

artificial intelligence, deep learning, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2207.09067

Country:

Asia > South Korea (0.04)
North America > United States > Maryland > Baltimore (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

Facebook AI Introduces TimeSformer: A New Video Architecture Based Purely On Transformers

#artificialintelligenceMar-17-2021, 12:53:00 GMT

Facebook AI has built a new architecture for video understanding called TimeSformer. The video architecture is purely based on Transformers. Transformers have become the dominant approach for many natural language processing (NLP) applications such as Machine Translation and General language understanding. TimeSformer was proven to achieve the best-reported numbers on multiple challenging action recognition benchmarks, including the Kinetics-400 action recognition data set. Compared with modern 3D convolutional neural networks, it is nearly three times faster to train requires less than one-tenth of computing inference.

timesformer, transformer, video, (8 more...)

#artificialintelligence

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.57)
Information Technology > Artificial Intelligence > Vision > Video Understanding (0.39)

Add feedback