Collaborating Authors


Neighborhood Structure Assisted Non-negative Matrix Factorization and its Application in Unsupervised Point Anomaly Detection Machine Learning

Dimensionality reduction is considered as an important step for ensuring competitive performance in unsupervised learning such as anomaly detection. Non-negative matrix factorization (NMF) is a popular and widely used method to accomplish this goal. But NMF, together with its recent, enhanced version, like graph regularized NMF or symmetric NMF, do not have the provision to include the neighborhood structure information and, as a result, may fail to provide satisfactory performance in presence of nonlinear manifold structure. To address that shortcoming, we propose to consider and incorporate the neighborhood structural similarity information within the NMF framework by modeling the data through a minimum spanning tree. What motivates our choice is the understanding that in the presence of complicated data structure, a minimum spanning tree can approximate the intrinsic distance between two data points better than a simple Euclidean distance does, and consequently, it constitutes a more reasonable basis for differentiating anomalies from the normal class data. We label the resulting method as the neighborhood structure assisted NMF. By comparing the formulation and properties of the neighborhood structure assisted NMF with other versions of NMF including graph regularized NMF and symmetric NMF, it is apparent that the inclusion of the neighborhood structure information using minimum spanning tree makes a key difference. We further devise both offline and online algorithmic versions of the proposed method. Empirical comparisons using twenty benchmark datasets as well as an industrial dataset extracted from a hydropower plant demonstrate the superiority of the neighborhood structure assisted NMF and support our claim of merit.

A general anomaly detection framework for fleet-based condition monitoring of machines Machine Learning

Machine failures decrease up-time and can lead to extra repair costs or even to human casualties and environmental pollution. Recent condition monitoring techniques use artificial intelligence in an effort to avoid time-consuming manual analysis and handcrafted feature extraction. Many of these only analyze a single machine and require a large historical data set. In practice, this can be difficult and expensive to collect. However, some industrial condition monitoring applications involve a fleet of similar operating machines. In most of these applications, it is safe to assume healthy conditions for the majority of machines. Deviating machine behavior is then an indicator for a machine fault. This work proposes an unsupervised, generic, anomaly detection framework for fleet-based condition monitoring. It uses generic building blocks and offers three key advantages. First, a historical data set is not required due to online fleet-based comparisons. Second, it allows incorporating domain expertise by user-defined comparison measures. Finally, contrary to most black-box artificial intelligence techniques, easy interpretability allows a domain expert to validate the predictions made by the framework. Two use-cases on an electrical machine fleet demonstrate the applicability of the framework to detect a voltage unbalance by means of electrical and vibration signatures.

Real-time Anomaly Detection and Classification in Streaming PMU Data Machine Learning

--Ensuring secure and reliable operations of the power grid is a primary concern of system operators. Phasor measurement units (PMUs) are rapidly being deployed in the grid to provide fast-sampled operational data that should enable quicker decision-making. This work presents a general interpretable framework for analyzing real-time PMU data, and thus enabling grid operators to understand the current state and to identify anomalies on the fly. Applying statistical learning tools on the streaming data, we first learn an effective dynamical model to describe the current behavior of the system. Next, we use the probabilistic predictions of our learned model to define in a principled way an efficient anomaly detection tool. Finally, the last module of our framework produces on-the-fly classification of the detected anomalies into common occurrence classes using features that grid operators are familiar with. We demonstrate the efficacy of our interpretable approach through extensive numerical experiments on real PMU data collected from a transmission operator in the USA. Traditional supervisory control and data acquisition (SCADA) systems provide information regarding the system state at the order of seconds to the operator. However, such fidelity, considered appropriate in prior decades, is not sufficient to observe or predict disturbances at faster timescales that are increasingly being observed in today's stochastic grid [1]. To provide more rapid measurement data, phasor measurement units (PMUs) have gained widespread deployment. PMUs [2] are time-synchronized by GPS timestamps and collect measurements of system states (Eg.

Multi-Stage Fault Warning for Large Electric Grids Using Anomaly Detection and Machine Learning Machine Learning

In the monitoring of a complex electric grid, it is of paramount importance to provide operators with early warnings of anomalies detected on the network, along with a precise classification and diagnosis of the specific fault type. In this paper, we propose a novel multi-stage early warning system prototype for electric grid fault detection, classification, subgroup discovery, and visualization. In the first stage, a computationally efficient anomaly detection method based on quartiles detects the presence of a fault in real time. In the second stage, the fault is classified into one of nine pre-defined disaster scenarios. The time series data are first mapped to highly discriminative features by applying dimensionality reduction based on temporal autocorrelation. The features are then mapped through one of three classification techniques: support vector machine, random forest, and artificial neural network. Finally in the third stage, intra-class clustering based on dynamic time warping is used to characterize the fault with further granularity. Results on the Bonneville Power Administration electric grid data show that i) the proposed anomaly detector is both fast and accurate; ii) dimensionality reduction leads to dramatic improvement in classification accuracy and speed; iii) the random forest method offers the most accurate, consistent, and robust fault classification; and iv) time series within a given class naturally separate into five distinct clusters which correspond closely to the geographical distribution of electric grid buses.

Detection of Adversarial Training Examples in Poisoning Attacks through Anomaly Detection Machine Learning

Machine learning has become an important component for many systems and applications including computer vision, spam filtering, malware and network intrusion detection, among others. Despite the capabilities of machine learning algorithms to extract valuable information from data and produce accurate predictions, it has been shown that these algorithms are vulnerable to attacks. Data poisoning is one of the most relevant security threats against machine learning systems, where attackers can subvert the learning process by injecting malicious samples in the training data. Recent work in adversarial machine learning has shown that the so-called optimal attack strategies can successfully poison linear classifiers, degrading the performance of the system dramatically after compromising a small fraction of the training dataset. In this paper we propose a defence mechanism to mitigate the effect of these optimal poisoning attacks based on outlier detection. We show empirically that the adversarial examples generated by these attack strategies are quite different from genuine points, as no detectability constrains are considered to craft the attack. Hence, they can be detected with an appropriate pre-filtering of the training dataset.

How to ask questions data science can solve. – Towards Data Science – Medium


My students frequently have trouble finding good data science questions. Usually, this is because they've yet to figure out how questions map to data solutions. I've found it insightful to use Bloom's Taxonomy with data technologies to draw a clearer picture. Data science tools may seems very limited at first, but we can rephrase most real world questions into the language of our tools. Bloom's Taxonomy categorizes learning objectives that educators use to lead their students.