Wu, Nannan (Beihang University) | Chen, Feng (State University of New York, Albany) | Li, Jianxin (Beihang University) | Zhou, Baojian (State University of New York, Albany) | Ramakrishnan, Naren (Virginia Polytechnic Institute and State University)
Non-parametric graph scan (NPGS) statistics are used to detect anomalous connected subgraphs on graphs, and have a wide variety of applications, such as disease outbreak detection, road traffic congestion detection, and event detection in social media. In contrast to traditional parametric scan statistics (e.g., the Kulldorff statistic), NPGS statistics are free of distributional assumptions and can be applied to heterogeneous graph data. In this paper, we make a number of contributions to the computational study of NPGS statistics. First, we present a novel reformulation of the problem as a sequence of Budget Price-Collecting Steiner Tree (B-PCST) sub-problems. Second, we show that this reformulated problem is NP-hard for a large class of nonparametric statistic functions. Third, we further develop efficient exact and approximate algorithms for a special category of graphs in which the anomalous subgraphs can be reformulated in a fixed tree topology. Finally, using extensive experiments we demonstrate the performance of our proposed algorithms in two real-world application domains (water pollution detection in water sensor networks and spatial event detection in social media networks) and contrast against state-of-the-art connected subgraph detection methods.
Mapping of spatial hotspots, i.e., regions with significantly higher rates or probability density of generating certain events (e.g., disease or crime cases), is a important task in diverse societal domains, including public health, public safety, transportation, agriculture, environmental science, etc. Clustering techniques required by these domains differ from traditional clustering methods due to the high economic and social costs of spurious results (e.g., false alarms of crime clusters). As a result, statistical rigor is needed explicitly to control the rate of spurious detections. To address this challenge, techniques for statistically-robust clustering have been extensively studied by the data mining and statistics communities. In this survey we present an up-to-date and detailed review of the models and algorithms developed by this field. We first present a general taxonomy of the clustering process with statistical rigor, covering key steps of data and statistical modeling, region enumeration and maximization, significance testing, and data update. We further discuss different paradigms and methods within each of key steps. Finally, we highlight research gaps and potential future directions, which may serve as a stepping stone in generating new ideas and thoughts in this growing field and beyond.
This work views neural networks as data generating systems and applies anomalous pattern detection techniques on that data in order to detect when a network is processing an anomalous input. Detecting anomalies is a critical component for multiple machine learning problems including detecting adversarial noise. More broadly, this work is a step towards giving neural networks the ability to recognize an out-of-distribution sample. This is the first work to introduce "Subset Scanning" methods from the anomalous pattern detection domain to the task of detecting anomalous input of neural networks. Subset scanning treats the detection problem as a search for the most anomalous subset of node activations (i.e., highest scoring subset according to non-parametric scan statistics). Mathematical properties of these scoring functions allow the search to be completed in log-linear rather than exponential time while still guaranteeing the most anomalous subset of nodes in the network is identified for a given input. Quantitative results for detecting and characterizing adversarial noise are provided for CIFAR-10 images on a simple convolutional neural network. We observe an "interference" pattern where anomalous activations in shallow layers suppress the activation structure of the original image in deeper layers.
Identifying anomalous patterns in real-world data is essential for understanding where, when, and how systems deviate from their expected dynamics. Yet methods that separately consider the anomalousness of each individual data point have low detection power for subtle, emerging irregularities. Additionally, recent detection techniques based on subset scanning make strong independence assumptions and suffer degraded performance in correlated data. We introduce methods for identifying anomalous patterns in non-iid data by combining Gaussian processes with novel log-likelihood ratio statistic and subset scanning techniques. Our approaches are powerful, interpretable, and can integrate information across multiple data streams. We illustrate their performance on numeric simulations and three open source spatiotemporal datasets of opioid overdose deaths, 311 calls, and storm reports.
Anomaly detection is a fundamental problem in dynamic networks. In this paper, we study an approach for identifying anomalous subgraphs based on the Heaviest Dynamic Subgraph (HDS) problem. The HDS in a time-evolving edge-weighted graph consists of a pair containing a subgraph and subinterval whose sum of edge weights is maximized. The HDS problem in a static graph is equivalent to the Prize Collecting Steiner Tree (PCST) problem with the Net-Worth objective---this is a very challenging problem, in general, and numerous heuristics have been proposed. Prior methods for the HDS problem use the PCST solution as a heuristic, and run in time quadratic in the size of the graph. As a result, they do not scale well to large instances. In this paper, we develop a new approach for the HDS problem, which combines rigorous algorithmic and practical techniques and has much better scalability. Our algorithm is able to extend to other variations of the HDS problem, such as the problem of finding multiple anomalous regions. We evaluate our algorithms in a diverse set of real and synthetic networks, and we find solutions with higher score and better detection power for anomalous events compared to earlier heuristics.