surveillance video
Knowledge-Guided Textual Reasoning for Explainable Video Anomaly Detection via LLMs
W e introduce T ext-based Explainable Video Anomaly Detection (TbVAD), a language-driven framework for W eakly Supervised Video Anomaly Detection (WSVAD) that performs anomaly detection and explanation entirely within the textual domain. Unlike conventional WSVAD models that rely on explicit visual features, TbVAD represents video semantics through language, enabling interpretable and knowledge-grounded reasoning. The framework operates in three stages: (1) transforming video content into fine-grained captions using a Vision-Language Model (VLM), (2) constructing structured knowledge by organizing the captions into four semantic slots--action, object, context, and environment, and (3) generating slot-wise explanations that reveal which semantic factors contribute most to the anomaly decision. W e evaluate TbVAD on two public benchmarks, UCF-Crime and XD-Violence, demonstrating that textual knowledge reasoning provides interpretable and reliable anomaly detection for real-world surveillance scenarios.
Human-Centric Anomaly Detection in Surveillance Videos Using YOLO-World and Spatio-Temporal Deep Learning
Naeen, Mohammad Ali Etemadi, Mohammadzade, Hoda, Shouraki, Saeed Bagheri
Anomaly detection in surveillance videos remains a challenging task due to the diversity of abnormal events, class imbalance, and scene-dependent visual clutter. To address these issues, we propose a robust deep learning framework that integrates human-centric preprocessing with spatio-temporal modeling for multi-class anomaly classification. Our pipeline begins by applying YOLO-World - an open-vocabulary vision-language detector - to identify human instances in raw video clips, followed by ByteTrack for consistent identity-aware tracking. Background regions outside detected bounding boxes are suppressed via Gaussian blurring, effectively reducing scene-specific distractions and focusing the model on behaviorally relevant foreground content. The refined frames are then processed by an ImageNet-pretrained InceptionV3 network for spatial feature extraction, and temporal dynamics are captured using a bidirectional LSTM (BiLSTM) for sequence-level classification. Evaluated on a five-class subset of the UCF-Crime dataset (Normal, Burglary, Fighting, Arson, Explosion), our method achieves a mean test accuracy of 92.41% across three independent trials, with per-class F1-scores consistently exceeding 0.85. Comprehensive evaluation metrics - including confusion matrices, ROC curves, and macro/weighted averages - demonstrate strong generalization and resilience to class imbalance. The results confirm that foreground-focused preprocessing significantly enhances anomaly discrimination in real-world surveillance scenarios.
- Asia > Middle East > Iran > Tehran Province > Tehran (0.05)
- North America > United States > Colorado > El Paso County > Colorado Springs (0.04)
- North America > United States > California > San Diego County > San Diego (0.04)
- (2 more...)
Context-Aware Zero-Shot Anomaly Detection in Surveillance Using Contrastive and Predictive Spatiotemporal Modeling
Khan, Md. Rashid Shahriar, Hasan, Md. Abrar, Justice, Mohammod Tareq Aziz
Detecting anomalies in surveillance footage is inherently challenging due to their unpredictable and context-dependent nature. This work introduces a novel context-aware zero-shot anomaly detection framework that identifies abnormal events without exposure to anomaly examples during training. The proposed hybrid architecture combines TimeSformer, DPC, and CLIP to model spatiotemporal dynamics and semantic context. TimeSformer serves as the vision backbone to extract rich spatial-temporal features, while DPC forecasts future representations to identify temporal deviations. Furthermore, a CLIP-based semantic stream enables concept-level anomaly detection through context-specific text prompts. These components are jointly trained using InfoNCE and CPC losses, aligning visual inputs with their temporal and semantic representations. A context-gating mechanism further enhances decision-making by modulating predictions with scene-aware cues or global video features. By integrating predictive modeling with vision-language understanding, the system can generalize to previously unseen behaviors in complex environments. This framework bridges the gap between temporal reasoning and semantic context in zero-shot anomaly detection for surveillance. The code for this research has been made available at https://github.com/NK-II/Context-Aware-Zero-Shot-Anomaly-Detection-in-Surveillance.
- Asia > Bangladesh > Dhaka Division > Dhaka District > Dhaka (0.04)
- Africa (0.04)
- Information Technology > Data Science > Data Mining > Anomaly Detection (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.68)
Video Forgery Detection for Surveillance Cameras: A Review
Tayfor, Noor B., Rashid, Tarik A., Qader, Shko M., Hassan, Bryar A., Abdalla, Mohammed H., Majidpour, Jafar, Ahmed, Aram M., Ali, Hussein M., Aladdin, Aso M., Abdullah, Abdulhady A., Shamsaldin, Ahmed S., Sidqi, Haval M., Salih, Abdulrahman, Yaseen, Zaher M., Ameen, Azad A., Nayak, Janmenjoy, Hamza, Mahmood Yashar
The widespread availability of video recording through smartphones and digital devices has made video-based evidence more accessible than ever. Surveillance footage plays a crucial role in security, law enforcement, and judicial processes. However, with the rise of advanced video editing tools, tampering with digital recordings has become increasingly easy, raising concerns about their authenticity. Ensuring the integrity of surveillance videos is essential, as manipulated footage can lead to misinformation and undermine judicial decisions. This paper provides a comprehensive review of existing forensic techniques used to detect video forgery, focusing on their effectiveness in verifying the authenticity of surveillance recordings. Various methods, including compression-based analysis, frame duplication detection, and machine learning-based approaches, are explored. The findings highlight the growing necessity for more robust forensic techniques to counteract evolving forgery methods. Strengthening video forensic capabilities will ensure that surveillance recordings remain credible and admissible as legal evidence.
- Asia > Middle East > Iraq > Erbil Governorate > Erbil (0.04)
- Europe > Switzerland > Basel-City > Basel (0.04)
- Asia > Middle East > Iraq > Kurdistan Region > Sulaymaniyah Governorate > Sulaymaniyah (0.04)
- (6 more...)
- Research Report (1.00)
- Overview (1.00)
- Media (1.00)
- Law (1.00)
- Information Technology > Security & Privacy (1.00)
- Commercial Services & Supplies > Security & Alarm Services (1.00)
- Information Technology > Security & Privacy (1.00)
- Information Technology > Data Science > Data Mining (1.00)
- Information Technology > Communications > Social Media (1.00)
- (6 more...)
Few-shot Semantic Encoding and Decoding for Video Surveillance
Cheng, Baoping, Zhang, Yukun, Wang, Liming, Xie, Xiaoyan, Fu, Tao, Wang, Dongkun, Tao, Xiaoming
With the continuous increase in the number and resolution of video surveillance cameras, the burden of transmitting and storing surveillance video is growing. Traditional communication methods based on Shannon's theory are facing optimization bottlenecks. Semantic communication, as an emerging communication method, is expected to break through this bottleneck and reduce the storage and transmission consumption of video. Existing semantic decoding methods often require many samples to train the neural network for each scene, which is time-consuming and labor-intensive. In this study, a semantic encoding and decoding method for surveillance video is proposed. First, the sketch was extracted as semantic information, and a sketch compression method was proposed to reduce the bit rate of semantic information. Then, an image translation network was proposed to translate the sketch into a video frame with a reference frame. Finally, a few-shot sketch decoding network was proposed to reconstruct video from sketch. Experimental results showed that the proposed method achieved significantly better video reconstruction performance than baseline methods. The sketch compression method could effectively reduce the storage and transmission consumption of semantic information with little compromise on video quality. The proposed method provides a novel semantic encoding and decoding method that only needs a few training samples for each surveillance scene, thus improving the practicality of the semantic communication system.
- Information Technology > Sensing and Signal Processing > Image Processing (1.00)
- Information Technology > Artificial Intelligence > Vision (0.97)
- Information Technology > Communications > Networks (0.86)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)
Towards Intelligent Transportation with Pedestrians and Vehicles In-the-Loop: A Surveillance Video-Assisted Federated Digital Twin Framework
Li, Xiaolong, Wei, Jianhao, Wang, Haidong, Dong, Li, Chen, Ruoyang, Yi, Changyan, Cai, Jun, Niyato, Dusit, Xuemin, null, Shen, null
In intelligent transportation systems (ITSs), incorporating pedestrians and vehicles in-the-loop is crucial for developing realistic and safe traffic management solutions. However, there is falls short of simulating complex real-world ITS scenarios, primarily due to the lack of a digital twin implementation framework for characterizing interactions between pedestrians and vehicles at different locations in different traffic environments. In this article, we propose a surveillance video assisted federated digital twin (SV-FDT) framework to empower ITSs with pedestrians and vehicles in-the-loop. Specifically, SVFDT builds comprehensive pedestrian-vehicle interaction models by leveraging multi-source traffic surveillance videos. Its architecture consists of three layers: (i) the end layer, which collects traffic surveillance videos from multiple sources; (ii) the edge layer, responsible for semantic segmentation-based visual understanding, twin agent-based interaction modeling, and local digital twin system (LDTS) creation in local regions; and (iii) the cloud layer, which integrates LDTSs across different regions to construct a global DT model in realtime. We analyze key design requirements and challenges and present core guidelines for SVFDT's system implementation. A testbed evaluation demonstrates its effectiveness in optimizing traffic management. Comparisons with traditional terminal-server frameworks highlight SV-FDT's advantages in mirroring delays, recognition accuracy, and subjective evaluation. Finally, we identify some open challenges and discuss future research directions.
- North America > Canada (0.47)
- Asia > China (0.14)
- North America > United States (0.14)
- Transportation > Ground > Road (1.00)
- Information Technology > Security & Privacy (1.00)
- Transportation > Infrastructure & Services (0.90)
Deep Learning based approach to detect Customer Age, Gender and Expression in Surveillance Video
Ijjina, Earnest Paul, Kanahasabai, Goutham, Joshi, Aniruddha Srinivas
In the current information era, customer analytics play a key role in the success of any business. Since customer demographics primarily dictate their preferences, identification and utilization of age & gender information of customers in sales forecasting, may maximize retail sales. In this work, we propose a computer vision based approach to age and gender prediction in surveillance video. The proposed approach leverage the effectiveness of Wide Residual Networks and Xception deep learning models to predict age and gender demographics of the consumers. The proposed approach is designed to work with raw video captured in a typical CCTV video surveillance system. The effectiveness of the proposed approach is evaluated on real-life garment store surveillance video, which is captured by low resolution camera, under non-uniform illumination, with occlusions due to crowding, and environmental noise. The system can also detect customer facial expressions during purchase in addition to demographics, that can be utilized to devise effective marketing strategies for their customer base, to maximize sales.
- Retail (1.00)
- Commercial Services & Supplies > Security & Alarm Services (0.35)
Customer Analytics using Surveillance Video
Ijjina, Earnest Paul, Joshi, Aniruddha Srinivas, Kanahasabai, Goutham, P, Keerthi Priyanka
The analysis of sales information, is a vital step in designing an effective marketing strategy. This work proposes a novel approach to analyse the shopping behaviour of customers to identify their purchase patterns. An extended version of the Multi-Cluster Overlapping k-Means Extension (MCOKE) algorithm with weighted k-Means algorithm is utilized to map customers to the garments of interest. The age & gender traits of the customer; the time spent and the expressions exhibited while selecting garments for purchase, are utilized to associate a customer or a group of customers to a garments they are interested in. Such study on the customer base of a retail business, may help in inferring the products of interest of their consumers, and enable them in developing effective business strategies, thus ensuring customer satisfaction, loyalty, increased sales and profits.
- Retail (1.00)
- Information Technology (0.69)
- Commercial Services & Supplies > Security & Alarm Services (0.48)
Receiver-Centric Generative Semantic Communications
Liu, Xunze, Sun, Yifei, Wang, Zhaorui, You, Lizhao, Pan, Haoyuan, Wang, Fangxin, Cui, Shuguang
This paper investigates semantic communications between a transmitter and a receiver, where original data, such as videos of interest to the receiver, is stored at the transmitter. Although significant process has been made in semantic communications, a fundamental design problem is that the semantic information is extracted based on certain criteria at the transmitter alone, without considering the receiver's specific information needs. As a result, critical information of primary concern to the receiver may be lost. In such cases, the semantic transmission becomes meaningless to the receiver, as all received information is irrelevant to its interests. To solve this problem, this paper presents a receiver-centric generative semantic communication system, where each transmission is initialized by the receiver. Specifically, the receiver first sends its request for the desired semantic information to the transmitter at the start of each transmission. Then, the transmitter extracts the required semantic information accordingly. A key challenge is how the transmitter understands the receiver's requests for semantic information and extracts the required semantic information in a reasonable and robust manner. We address this challenge by designing a well-structured framework and leveraging off-the-shelf generative AI products, such as GPT-4, along with several specialized tools for detection and estimation. Evaluation results demonstrate the feasibility and effectiveness of the proposed new semantic communication system.
Weakly-Supervised Anomaly Detection in Surveillance Videos Based on Two-Stream I3D Convolution Network
Nejad, Sareh Soltani, Haque, Anwar
The widespread implementation of urban surveillance systems has necessitated more sophisticated techniques for anomaly detection to ensure enhanced public safety. This paper presents a significant advancement in the field of anomaly detection through the application of Two-Stream Inflated 3D (I3D) Convolutional Networks. These networks substantially outperform traditional 3D Convolutional Networks (C3D) by more effectively extracting spatial and temporal features from surveillance videos, thus improving the precision of anomaly detection. Our research advances the field by implementing a weakly supervised learning framework based on Multiple Instance Learning (MIL), which uniquely conceptualizes surveillance videos as collections of 'bags' that contain instances (video clips). Each instance is innovatively processed through a ranking mechanism that prioritizes clips based on their potential to display anomalies. This novel strategy not only enhances the accuracy and precision of anomaly detection but also significantly diminishes the dependency on extensive manual annotations. Moreover, through meticulous optimization of model settings, including the choice of optimizer, our approach not only establishes new benchmarks in the performance of anomaly detection systems but also offers a scalable and efficient solution for real-world surveillance applications. This paper contributes significantly to the field of computer vision by delivering a more adaptable, efficient, and context-aware anomaly detection system, which is poised to redefine practices in urban surveillance.
- North America > Canada > Ontario > Middlesex County > London (0.04)
- Asia > Middle East > Jordan (0.04)
- Information Technology > Data Science > Data Mining > Anomaly Detection (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.95)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)