behavior recognition
PriVi: Towards A General-Purpose Video Model For Primate Behavior In The Wild
Mueller, Felix B., Meier, Jan F., Lueddecke, Timo, Vogg, Richard, Freixanet, Roger L., Hassler, Valentin, Bosshard, Tiffany, Karakoc, Elif, O'Hearn, William J., Pereira, Sofia M., Sehner, Sandro, Wierucka, Kaja, Burkart, Judith, Fichtel, Claudia, Fischer, Julia, Gail, Alexander, Hobaiter, Catherine, Ostner, Julia, Samuni, Liran, Schülke, Oliver, Shahidi, Neda, Wessling, Erin G., Ecker, Alexander S.
Non-human primates are our closest living relatives, and analyzing their behavior is central to research in cognition, evolution, and conservation. Computer vision could greatly aid this research, but existing methods often rely on human-centric pretrained models and focus on single datasets, which limits generalization. W e address this limitation by shifting from a model-centric to a data-centric approach and introduce PriVi, a large-scale primate-centric video pretraining dataset. PriVi contains 424 hours of curated video, combining 174 hours from behavioral research across 11 settings with 250 hours of diverse web-sourced footage, assembled through a scalable data cura-tion pipeline. W e continue pretraining V-JEP A, a large-scale video model, on PriVi to learn primate-specific representations and evaluate it using a lightweight frozen classifier . Across four benchmark datasets - ChimpACT, PanAf500, BaboonLand, and ChimpBehave - our approach consistently outperforms prior work, including fully fine-tuned baselines, and scales favorably with fewer labels. These results demonstrate that primate-centric pretraining substantially improves data efficiency and generalization, making it a promising approach for low-label applications. Code, models, and the majority of the dataset will be made available.
A Framework Combining 3D CNN and Transformer for Video-Based Behavior Recognition
Zhang, Xiuliang, Nyamasvisva, Tadiwa Elisha, Liu, Chuntao
Video-based behavior recognition is essential in fields such as public safety, intelligent surveillance, and human-computer interaction. Traditional 3D Convolutional Neural Network (3D CNN) effectively capture local spatiotemporal features but struggle with modeling long-range dependencies. Conversely, Transformers excel at learning global contextual information but face challenges with high computational costs. To address these limitations, we propose a hybrid framework combining 3D CNN and Transformer architectures. The 3D CNN module extracts low-level spatiotemporal features, while the Transformer module captures long-range temporal dependencies, with a fusion mechanism integrating both representations. Evaluated on benchmark datasets, the proposed model outperforms traditional 3D CNN and standalone Transformers, achieving higher recognition accuracy with manageable complexity. Ablation studies further validate the complementary strengths of the two modules. This hybrid framework offers an effective and scalable solution for video-based behavior recognition.
Artificial Behavior Intelligence: Technology, Challenges, and Future Directions
Jo, Kanghyun, Choi, Jehwan, Kim, Kwanho, Kim, Seongmin, Nguyen, Duy-Linh, Vo, Xuan-Thuy, Priadana, Adri, Tran, Tien-Dat
--Understanding and predicting human behavior has emerged as a core capability in various AI application domains such as autonomous driving, smart healthcare, surveillance systems, and social robotics. This paper defines the technical framework of Artificial Behavior Intelligence (ABI), which comprehensively analyzes and interprets human posture, facial expressions, emotions, behavioral sequences, and contextual cues. It details the essential components of ABI, including pose estimation, face and emotion recognition, sequential behavior analysis, and context-aware modeling. Furthermore, we highlight the transformative potential of recent advances in large-scale pretrained models, such as large language models (LLMs), vision foundation models, and multimodal integration models, in significantly improving the accuracy and interpretability of behavior recognition. Our research team has a strong interest in the ABI domain and is actively conducting research, particularly focusing on the development of intelligent lightweight models capable of efficiently inferring complex human behaviors. This paper identifies several technical challenges that must be addressed to deploy ABI in real-world applications including learning behavioral intelligence from limited data, quantifying uncertainty in complex behavior prediction, and optimizing model structures for low-power, real-time inference. T o tackle these challenges, our team is exploring various optimization strategies including lightweight transformers, graph-based recognition architectures, energy-aware loss functions, and multimodal knowledge distillation, while validating their applicability in real-time environments. The philosopher Aristotle once described human beings as "social animals." This statement implies that humans do not exist as isolated entities, but rather live in constant interaction and communication with others. Humans intuitively perceive others' emotions, states, and intentions through their tone of voice, facial expressions, gestures, and behavioral patterns. These abilities are fundamental to mutual understanding and empathetic social interaction.
Driving behavior recognition via self-discovery learning
Autonomous driving systems require a deep understanding of human driving behaviors to achieve higher intelligence and safety.Despite advancements in deep learning, challenges such as long-tail distribution due to scarce samples and confusion from similar behaviors hinder effective driving behavior detection.Existing methods often fail to address sample confusion adequately, as datasets frequently contain ambiguous samples that obscure unique semantic information.
Pig behavior dataset and Spatial-temporal perception and enhancement networks based on the attention mechanism for pig behavior recognition
Qi, Fangzheng, Hou, Zhenjie, Lin, En, Li, Xing, Liang, iuzhen, Zhou, Xinwen
The recognition of pig behavior plays a crucial role in smart farming and welfare assurance for pigs. Currently, in the field of pig behavior recognition, the lack of publicly available behavioral datasets not only limits the development of innovative algorithms but also hampers model robustness and algorithm optimization.This paper proposes a dataset containing 13 pig behaviors that significantly impact welfare.Based on this dataset, this paper proposes a spatial-temporal perception and enhancement networks based on the attention mechanism to model the spatiotemporal features of pig behaviors and their associated interaction areas in video data. The network is composed of a spatiotemporal perception network and a spatiotemporal feature enhancement network. The spatiotemporal perception network is responsible for establishing connections between the pigs and the key regions of their behaviors in the video data. The spatiotemporal feature enhancement network further strengthens the important spatial features of individual pigs and captures the long-term dependencies of the spatiotemporal features of individual behaviors by remodeling these connections, thereby enhancing the model's perception of spatiotemporal changes in pig behaviors. Experimental results demonstrate that on the dataset established in this paper, our proposed model achieves a MAP score of 75.92%, which is an 8.17% improvement over the best-performing traditional model. This study not only improces the accuracy and generalizability of individual pig behavior recognition but also provides new technological tools for modern smart farming. The dataset and related code will be made publicly available alongside this paper.
Hear Me, See Me, Understand Me: Audio-Visual Autism Behavior Recognition
Deng, Shijian, Kosloski, Erin E., Patel, Siddhi, Barnett, Zeke A., Nan, Yiyang, Kaplan, Alexander, Aarukapalli, Sisira, Doan, William T., Wang, Matthew, Singh, Harsh, Rollins, Pamela R., Tian, Yapeng
In this article, we introduce a novel problem of audio-visual autism behavior recognition, which includes social behavior recognition, an essential aspect previously omitted in AI-assisted autism screening research. We define the task at hand as one that is audio-visual autism behavior recognition, which uses audio and visual cues, including any speech present in the audio, to recognize autism-related behaviors. To facilitate this new research direction, we collected an audio-visual autism spectrum dataset (AV-ASD), currently the largest video dataset for autism screening using a behavioral approach. It covers an extensive range of autism-associated behaviors, including those related to social communication and interaction. To pave the way for further research on this new problem, we intensively explored leveraging foundation models and multimodal large language models across different modalities. Our experiments on the AV-ASD dataset demonstrate that integrating audio, visual, and speech modalities significantly enhances the performance in autism behavior recognition. Additionally, we explored the use of a post-hoc to ad-hoc pipeline in a multimodal large language model to investigate its potential to augment the model's explanatory capability during autism behavior recognition. We will release our dataset, code, and pre-trained models.
Deep Neural Networks in Video Human Action Recognition: A Review
Wang, Zihan, Yang, Yang, Liu, Zhi, Zheng, Yifan
Currently, video behavior recognition is one of the most foundational tasks of computer vision. The 2D neural networks of deep learning are built for recognizing pixel-level information such as images with RGB, RGB-D, or optical flow formats, with the current increasingly wide usage of surveillance video and more tasks related to human action recognition. There are increasing tasks requiring temporal information for frames dependency analysis. The researchers have widely studied video-based recognition rather than image-based(pixel-based) only to extract more informative elements from geometry tasks. Our current related research addresses multiple novel proposed research works and compares their advantages and disadvantages between the derived deep learning frameworks rather than machine learning frameworks. The comparison happened between existing frameworks and datasets, which are video format data only. Due to the specific properties of human actions and the increasingly wide usage of deep neural networks, we collected all research works within the last three years between 2020 to 2022. In our article, the performance of deep neural networks surpassed most of the techniques in the feature learning and extraction tasks, especially video action recognition.
Adaptive Multi-Agent Continuous Learning System
Qian, Xingyu, Yuemaier, Aximu, Liang, Longfei, Yang, Wen-Chi, Chen, Xiaogang, Li, Shunfen, Dai, Weibang, Song, Zhitang
We propose an adaptive multi-agent clustering recognition system that can be self-supervised driven, based on a temporal sequences continuous learning mechanism with adaptability. The system is designed to use some different functional agents to build up a connection structure to improve adaptability to cope with environmental diverse demands, by predicting the input of the agent to drive the agent to achieve the act of clustering recognition of sequences using the traditional algorithmic approach. Finally, the feasibility experiments of video behavior clustering demonstrate the feasibility of the system to cope with dynamic situations. Our work is placed here\footnote{https://github.com/qian-git/MAMMALS}.
Machine Learning and Airport Security See Eye to Eye
The prospect of standing for hours on end has become all too common at airports around the world. But soon, airports may be piloting security programs based on behavior recognition and machine learning, instead of asking passengers to practice patience. As we know, patience is becoming a lost art, but predictive analytics based on sensor, device, and video data is a technology art form that airlines and airports are exploring. The 9/11 attacks and the 2001 Shoe Bomber's attempt are among the most well-known security threats, and they upended how we travel. To protect passengers and crews, airports have made finding dangerous items their primary objective.
Improvement of Multi-AUV Cooperation through Teammate Verification
Novitzky, Michael (The Georgia Institute of Technology)
Current methods for multi-AUV cooperation suffer in low communication environments. State of the art methods employ auctioneering or planning to determine a single AUV'task. These systems require communication to update models of teammates and tasks for efficient task selection. Most strategies assume a teammate is inoperable if a communication timeout is reached which reduces overall team efficiency. Including teammate prediction has been shown to mitigate efficiency degeneration due to low communication. However, there is no verification of a predicted teammate's task other than through eventual communication. A possible verification tool is behavior recognition. Current behavior recognition utilizes either overhead sensors or post mission analysis to track robot trajectories in order to infer their internal state. A system in which an AUV is capable of sensing a teammate, for example through a forward-looking sonar, and deducing it's behavior along with contextual information, such as location, will enable an AUV to determine that teammate's current task in the overall mission. This will allow for an accurate update of that teammate's model allowing the AUV to more efficiently determine its own next task rather than relying only on communication. This position paper posits that multi-AUV cooperation efficiency will improve in low communication environments with the combination of robust teammate prediction along with verification using behavior recognition.