Goto

Collaborating Authors

 Chang, Chih-Yung


Detection, Retrieval, and Explanation Unified: A Violence Detection System Based on Knowledge Graphs and GAT

arXiv.org Artificial Intelligence

Recently, violence detection systems developed using unified multimodal models have achieved significant success and attracted widespread attention. However, most of these systems face two critical challenges: the lack of interpretability as black-box models and limited functionality, offering only classification or retrieval capabilities. To address these challenges, this paper proposes a novel interpretable violence detection system, termed the Three-in-One (TIO) System. The TIO system integrates knowledge graphs (KG) and graph attention networks (GAT) to provide three core functionalities: detection, retrieval, and explanation. Specifically, the system processes each video frame along with text descriptions generated by a large language model (LLM) for videos containing potential violent behavior. It employs ImageBind to generate high-dimensional embeddings for constructing a knowledge graph, uses GAT for reasoning, and applies lightweight time series modules to extract video embedding features. The final step connects a classifier and retriever for multi-functional outputs. The interpretability of KG enables the system to verify the reasoning process behind each output. Additionally, the paper introduces several lightweight methods to reduce the resource consumption of the TIO system and enhance its efficiency. Extensive experiments conducted on the XD-Violence and UCF-Crime datasets validate the effectiveness of the proposed system. A case study further reveals an intriguing phenomenon: as the number of bystanders increases, the occurrence of violent behavior tends to decrease.


Injecting Explainability and Lightweight Design into Weakly Supervised Video Anomaly Detection Systems

arXiv.org Artificial Intelligence

Weakly Supervised Monitoring Anomaly Detection (WSMAD) utilizes weak supervision learning to identify anomalies, a critical task for smart city monitoring. However, existing multimodal approaches often fail to meet the real-time and interpretability requirements of edge devices due to their complexity. This paper presents TCVADS (Two-stage Cross-modal Video Anomaly Detection System), which leverages knowledge distillation and cross-modal contrastive learning to enable efficient, accurate, and interpretable anomaly detection on edge devices.TCVADS operates in two stages: coarse-grained rapid classification and fine-grained detailed analysis. In the first stage, TCVADS extracts features from video frames and inputs them into a time series analysis module, which acts as the teacher model. Insights are then transferred via knowledge distillation to a simplified convolutional network (student model) for binary classification. Upon detecting an anomaly, the second stage is triggered, employing a fine-grained multi-class classification model. This stage uses CLIP for cross-modal contrastive learning with text and images, enhancing interpretability and achieving refined classification through specially designed triplet textual relationships. Experimental results demonstrate that TCVADS significantly outperforms existing methods in model performance, detection efficiency, and interpretability, offering valuable contributions to smart city monitoring applications.


Explaining the Unexplained: Revealing Hidden Correlations for Better Interpretability

arXiv.org Artificial Intelligence

Thanks to the rapid advancement of computer hardware, deep learning has made significant progress in the application of unstructured data, such as images (Cao & Chen, 2025) and text (Li et al., 2024). Specifically, the success of representation learning (Wang & Lian, 2025; Zhang et al., 2025) has gradually replaced the earlier approaches of transforming unstructured data into structured formats. The key to the success of representation learning lies in leveraging a large number of parameters for backpropagation, enabling the model to adapt to data with non-normal distributions. Although models based on backpropagation neural networks (Yang et al., 2019; Banerjee et al., 2023) have achieved significant technical advancements, their application in many sensitive domains, such as medicine (Zhang et al., 2025) and industrial inspection (Rathee et al., 2021), still faces considerable challenges due to the difficulty in understanding the basis of their decision-making. Explainable Artificial Intelligence (XAI) aims to reveal the inner mechanisms of neural network decisions, thereby making these models more reliable for applications in sensitive domains. In recent years, several studies (Li et al., 2025; Jing et al., 2025; Liu et al., 2024; Guan et al., 2024) have focused on injecting explainability into deep learning models and using various visualization techniques to explain the decisions of these "black box" models. While these models have achieved a certain level of interpretability, two pressing issues remain (Huang & Marques, 2023; Huang & Marques, 2024): first, whether the correlations between different attributes are correctly evaluated, and second, whether the model's decision-making pathway truly aligns with human reasoning, even when the model's understanding appears consistent with user expectations.


From Explicit Rules to Implicit Reasoning in an Interpretable Violence Monitoring System

arXiv.org Artificial Intelligence

Recently, research based on pre-trained models has demonstrated outstanding performance in violence surveillance tasks. However, most of them were black-box systems which faced challenges regarding explainability during training and inference processes. An important question is how to incorporate explicit knowledge into these implicit models, thereby designing expertdriven and interpretable violence surveillance systems. This paper proposes a new paradigm for weakly supervised violence monitoring (WSVM) called Rule base Violence Monitoring (RuleVM). The proposed RuleVM uses a dual-branch structure with different designs for images and text. One of the branches is called the implicit branch, which uses only visual features for coarse-grained binary classification. In this branch, image feature extraction is divided into two channels: one responsible for extracting scene frames and the other focusing on extracting actions. The other branch is called the explicit branch, which utilizes language-image alignment to perform fine-grained classification. For the language channel design in the explicit branch, the proposed RuleVM uses the state-of-the-art YOLOWorld model to detect objects in video frames, and association rules are identified through data mining methods as descriptions of the video. Leveraging the dual-branch architecture, RuleVM achieves interpretable coarse-grained and fine-grained violence surveillance. Extensive experiments were conducted on two commonly used benchmarks, and the results show that RuleVM achieved the best performance in both coarse-grained and finegrained monitoring, significantly outperforming existing state-ofthe-art methods. Moreover, interpretability experiments uncovered some interesting rules, such as the observation that as the number of people increases, the risk level of violent behavior also rises.