Networking Systems for Video Anomaly Detection: A Tutorial and Survey

Liu, Jing, Liu, Yang, Lin, Jieyu, Li, Jielin, Sun, Peng, Hu, Bo, Song, Liang, Boukerche, Azzedine, Leung, Victor C. M.

arXiv.org Artificial Intelligence 

With the widespread use of surveillance cameras in smart cities [104] and the boom of online video applications powered by 4/5G communication technologies, traditional human inspection is no longer able to accurately monitor the video data generated around the clock, which is not only time-consuming and labor-intensive but also poses the risk of leaking important information (e.g., biometrics and sensitive speech). In contrast, VAD-empowered IoVT applications [54], such as Intelligent Surveillance Systems (IVSS) and automated content analysis platforms, can process massive video streams online and detect events of interest in real-time, sending only noteworthy anomaly parts for human review, significantly reducing data storage and communication costs, and helping to eliminate public concerns about data security and privacy protection. As a result, VAD has gained widespread attention in academia and industry over the last decade and has been used in emerging fields such as information forensics [154], industrial manufacturing [71] in smart cities as well as online content analysis in mobile video applications [153]. VAD extends the data scope of conventional Anomaly Detection (AD) from time series, images, and graphs to video, which not only needs to cope with the endogenous data complexity, but also needs to take into account the computational and communication costs in resource-limited devices [55]. Specifically, the inherent high-dimensional structure of video data, high information density and redundancy, heterogeneity of temporal and spatial patterns, and feature entanglement between foreground targets and background scenes make VAD more challenging than traditional AD tasks at the levels of representation learning and anomaly discrimination [89]. Existing studies [4, 60, 69, 76] have shown that high-performance VAD models need to target the modeling of appearance and motion information, i.e., the difference between regular events and anomalous examples in both spatial and temporal dimensions. In contrast to time series AD that mainly measures periodic temporal patterns of variables, and image AD which only focusing on spatial contextual deviations, VAD needs to extract both discriminative spatial and temporal features from a large amount of redundant information (e.g., repetitive temporal contexts and label-independent data distributions), as well as to learn the differences between normal and anomalous events in terms of their local appearances and global motions [100]. However, video anomalies are ambiguous and subjective [48].

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found