MAD-GAN: Multivariate Anomaly Detection for Time Series Data with Generative Adversarial Networks Machine Learning

The prevalence of networked sensors and actuators in many real-world systems such as smart buildings, factories, power plants, and data centers generate substantial amounts of multivariate time series data for these systems. The rich sensor data can be continuously monitored for intrusion events through anomaly detection. However, conventional threshold-based anomaly detection methods are inadequate due to the dynamic complexities of these systems, while supervised machine learning methods are unable to exploit the large amounts of data due to the lack of labeled data. On the other hand, current unsupervised machine learning approaches have not fully exploited the spatial-temporal correlation and other dependencies amongst the multiple variables (sensors/actuators) in the system for detecting anomalies. In this work, we propose an unsupervised multivariate anomaly detection method based on Generative Adversarial Networks (GANs). Instead of treating each data stream independently, our proposed MAD-GAN framework considers the entire variable set concurrently to capture the latent interactions amongst the variables. We also fully exploit both the generator and discriminator produced by the GAN, using a novel anomaly score called DR-score to detect anomalies by discrimination and reconstruction. We have tested our proposed MAD-GAN using two recent datasets collected from real-world CPS: the Secure Water Treatment (SWaT) and the Water Distribution (WADI) datasets. Our experimental results showed that the proposed MAD-GAN is effective in reporting anomalies caused by various cyber-intrusions compared in these complex real-world systems.

Anomaly Detection Techniques in Focus: Multivariate and Univariate Anodot


In our previous post, we explained what time series data is and provided some details as to how the Anodot time series anomaly detection system is able to spot anomalies in time series data. We also discussed the importance of choosing a model for a metric's normal behavior which included any and all seasonal patterns in the metric, and the specific algorithm which Anodot uses to find seasonal patterns. At the end of that post we said it's possible to get a sense of the bigger picture from a lot of individual anomalies. Conciseness is a requirement of any large-scale anomaly detection system because monitoring millions of metrics is guaranteed to generate a flood of reported anomalies, even if there are zero false positives. Achieving conciseness in this context is analogous to distilling the many individual symptoms into a single diagnosis, in much the same way that a mechanic might diagnose a car problem by observing the pitch, volume, and duration of all the sounds it makes, in addition to watching all the dials and indicator lights on the dashboard.

Change Detection in Multivariate Datastreams: Likelihood and Detectability Loss Machine Learning

We address the problem of detecting changes in multivariate datastreams, and we investigate the intrinsic difficulty that change-detection methods have to face when the data dimension scales. In particular, we consider a general approach where changes are detected by comparing the distribution of the log-likelihood of the datastream over different time windows. Despite the fact that this approach constitutes the frame of several change-detection methods, its effectiveness when data dimension scales has never been investigated, which is indeed the goal of our paper. We show that the magnitude of the change can be naturally measured by the symmetric Kullback-Leibler divergence between the pre- and post-change distributions, and that the detectability of a change of a given magnitude worsens when the data dimension increases. This problem, which we refer to as \emph{detectability loss}, is due to the linear relationship between the variance of the log-likelihood and the data dimension. We analytically derive the detectability loss on Gaussian-distributed datastreams, and empirically demonstrate that this problem holds also on real-world datasets and that can be harmful even at low data-dimensions (say, 10).

DOPING: Generative Data Augmentation for Unsupervised Anomaly Detection with GAN Machine Learning

Recently, the introduction of the generative adversarial network (GAN) and its variants has enabled the generation of realistic synthetic samples, which has been used for enlarging training sets. Previous work primarily focused on data augmentation for semi-supervised and supervised tasks. In this paper, we instead focus on unsupervised anomaly detection and propose a novel generative data augmentation framework optimized for this task. In particular, we propose to oversample infrequent normal samples - normal samples that occur with small probability, e.g., rare normal events. We show that these samples are responsible for false positives in anomaly detection. However, oversampling of infrequent normal samples is challenging for real-world high-dimensional data with multimodal distributions. To address this challenge, we propose to use a GAN variant known as the adversarial autoencoder (AAE) to transform the high-dimensional multimodal data distributions into low-dimensional unimodal latent distributions with well-defined tail probability. Then, we systematically oversample at the `edge' of the latent distributions to increase the density of infrequent normal samples. We show that our oversampling pipeline is a unified one: it is generally applicable to datasets with different complex data distributions. To the best of our knowledge, our method is the first data augmentation technique focused on improving performance in unsupervised anomaly detection. We validate our method by demonstrating consistent improvements across several real-world datasets.

Tutorial on Outlier Detection in Python using the PyOD Library


My latest data science project involved predicting the sales of each product in a particular store. There were several ways I could approach the problem. But no matter which model I used, my accuracy score would not improve. I figured out the problem after spending some time inspecting the data – outliers! This is a commonly overlooked mistake we tend to make.