Accuracy
Using Machine Learning to Discern Eruption in Noisy Environments: A Case Study using CO2-driven Cold-Water Geyser in Chimayo, New Mexico
Yuan, B., Tan, Y. J., Mudunuru, M. K., Marcillo, O. E., Delorey, A. A., Roberts, P. M., Webster, J. D., Gammans, C. N. L., Karra, S., Guthrie, G. D., Johnson, P. A.
We present an approach based on machine learning (ML) to distinguish eruption and precursory signals of Chimay\'{o} geyser (New Mexico, USA) under noisy environments. This geyser can be considered as a natural analog of $\mathrm{CO}_2$ intrusion into shallow water aquifers. By studying this geyser, we can understand upwelling of $\mathrm{CO}_2$-rich fluids from depth, which has relevance to leak monitoring in a $\mathrm{CO}_2$ sequestration project. ML methods such as Random Forests (RF) are known to be robust multi-class classifiers and perform well under unfavorable noisy conditions. However, the extent of the RF method's accuracy is poorly understood for this $\mathrm{CO}_2$-driven geysering application. The current study aims to quantify the performance of RF-classifiers to discern the geyser state. Towards this goal, we first present the data collected from the seismometer that is installed near the Chimay\'{o} geyser. The seismic signals collected at this site contain different types of noises such as daily temperature variations, seasonal trends, animal movement near the geyser, and human activity. First, we filter the signals from these noises by combining the Butterworth-Highpass filter and an Autoregressive method in a multi-level fashion. We show that by combining these filtering techniques, in a hierarchical fashion, leads to reduction in the noise in the seismic data without removing the precursors and eruption event signals. We then use RF on the filtered data to classify the state of geyser into three classes -- remnant noise, precursor, and eruption states. We show that the classification accuracy using RF on the filtered data is greater than 90\%.These aspects make the proposed ML framework attractive for event discrimination and signal enhancement under noisy conditions, with strong potential for application to monitoring leaks in $\mathrm{CO}_2$ sequestration.
Classification from Positive, Unlabeled and Biased Negative Data
Hsieh, Yu-Guan, Niu, Gang, Sugiyama, Masashi
In conventional binary classification, examples are labeled as either positive (P) or negative (N), and we train a classifier on these labeled examples. On the contrary, positive-unlabeled (PU) learning addresses the problem of learning a classifier from P and unlabeled (U) data, without need of explicitly identifying N data (Elkan & Noto, 2008; Ward et al., 2009). PU learning finds its usefulness in many real-world problems. For example, in one-class remote sensing classification (Li et al., 2011), we seek to extract a specific land-cover class from an image. While it is easy to label examples of this specific land-cover class of interest, examples not belonging to this class are too diverse to be exhaustively annotated. The same problem arises in text classification, as it is difficult or even impossible to compile a set of N samples that provides a comprehensive characterization of everything that is not in the P class (Liu et al., 2003; Fung et al., 2006). Besides, PU learning has also been applied to other domains such as outlier detection (Hido et al., 2008; Scott & Blanchard, 2009), medical diagnosis (Zuluaga et al., 2011), or time series classification (Nguyen et al., 2011). By carefully examining the above examples, we find out that the most difficult step is often to collect a fully representative N set, whereas only labeling a small portion of all possible N data is relatively easy. Therefore, in this paper, we propose to study the problem of learning from P, U and biased N (bN) data, which we name PUbN learning hereinafter.
One-Click Annotation with Guided Hierarchical Object Detection
Subramanian, Adithya, Subramanian, Anbumani
The increase in data collection has made data annotation an interesting and valuable task in the contemporary world. This paper presents a new methodology for quickly annotating data using click-supervision and hierarchical object detection. The proposed work is semi-automatic in nature where the task of annotations is split between the human and a neural network. We show that our improved method of annotation reduces the time, cost and mental stress on a human annotator. The research also highlights how our method performs better than the current approach in different circumstances such as variation in number of objects, object size and different datasets. Our approach also proposes a new method of using object detectors making it suitable for data annotation task. The experiment conducted on PASCAL VOC dataset revealed that annotation created from our approach achieves a mAP of 0.995 and a recall of 0.903. The Our Approach has shown an overall improvement by 8.5%, 18.6% in mean average precision and recall score for KITTI and 69.6%, 36% for CITYSCAPES dataset. The proposed framework is 3-4 times faster as compared to the standard annotation method.
Artificial Intelligence Enabled Software Defined Networking: A Comprehensive Overview
Software defined networking (SDN) represents a promising networking architecture that combines central management and network programmability. SDN separates the control plane from the data plane and moves the network management to a central point, called the controller, that can be programmed and used as the brain of the network. Recently, the research community has showed an increased tendency to benefit from the recent advancements in the artificial intelligence (AI) field to provide learning abilities and better decision making in SDN. In this study, we provide a detailed overview of the recent efforts to include AI in SDN. Our study showed that the research efforts focused on three main sub-fields of AI namely: machine learning, meta-heuristics and fuzzy inference systems. Accordingly, in this work we investigate their different application areas and potential use, as well as the improvements achieved by including AI-based techniques in the SDN paradigm.
Why it's hard to design fair machine learning models
In this episode of the Data Show, I spoke with Sharad Goel, assistant professor at Stanford, and his student Sam Corbett-Davies. They recently wrote a survey paper, "A Critical Review of Fair Machine Learning," where they carefully examined the standard statistical tools used to check for fairness in machine learning models. It turns out that each of the standard approaches (anti-classification, classification parity, and calibration) has limitations, and their paper is a must-read tour through recent research in designing fair algorithms. We talked about their key findings, and, most importantly, I pressed them to list a few best practices that analysts and industrial data scientists might want to consider. Sam Corbett-Davies: The problem with many of the standard metrics is that they fail to take into account how different groups might have different distributions of risk.
ML Metrics: Sensitivity vs. Specificity - DZone AI
In this post, we will try and understand the concepts behind evaluation metrics such as sensitivity and specificity, which is used to determine the performance of the Machine Learning models. The post also describes the differences between sensitivity and specificity. The concepts have been explained using the model for predicting whether a person is suffering from a disease or not. Sensitivity is a measure of the proportion of actual positive cases that got predicted as positive (or true positive). Sensitivity is also termed as Recall.
What Really Matters with Machine Learning
History will look back on our time as the beginning of the artificial intelligence revolution. In 2017, artificial intelligences are beating us at Go, translating and inventing their own languages, helping us decide what to buy, writing for us, and composing music. Neural networks can even be used for image compression! As you might expect, the endpoint security industry is benefiting greatly from AI as we are using it for everything, from detecting threats to unusual network activity. However, sometimes the problem with a complex, new technology, apart from actually inventing and building it, is figuring out how to explain it to customers -- how does it work and why is it valuable.
Using Confusion Matrices to Quantify the Cost of Being Wrong
There are so many confusing and sometimes even counter-intuitive concepts in statistics. I mean, come on…even explaining the differences between Null Hypothesis and Alternative Hypothesis can be an ordeal. All I want to do is to understand and quantify the cost of my analytical models being wrong. For example, let's say that I'm a shepherd who has bad eyesight and have a hard time distinguishing between a wolf and a sheep dog. That's obviously a bad trait, because the costs of being wrong are very expensive: Okay, so I'm not a very good shepherd, but I am a very sophisticated shepherd and I've build a Neural Network application to distinguish a sheep dog from a wolf.
Cost-Sensitive Learning for Predictive Maintenance
Spiegel, Stephan, Mueller, Fabian, Weismann, Dorothea, Bird, John
In predictive maintenance, model performance is usually assessed by means of precision, recall, and F1-score. However, employing the model with best performance, e.g. highest F1-score, does not necessarily result in minimum maintenance cost, but can instead lead to additional expenses. Thus, we propose to perform model selection based on the economic costs associated with the particular maintenance application. We show that cost-sensitive learning for predictive maintenance can result in significant cost reduction and fault tolerant policies, since it allows to incorporate various business constraints and requirements.