Accuracy
Detecting Cyberattack Entities from Audit Data via Multi-View Anomaly Detection with Feedback
Siddiqui, Md Amran (Oregon State University) | Fern, Alan (Oregon State University) | Wright, Ryan (Galois, Inc.) | Theriault, Alec (Galois, Inc.) | Archer, David (Galois, Inc.) | Maxwell, William (Galois, Inc.)
In this paper, we consider the problem of detecting unknown cyberattacks from audit data of system-level events. A key challenge is that different cyberattacks will have different suspicion indicators, which are not known beforehand. To address this we consider a multi-view anomaly detection framework, where multiple expert-designed ``views" of the data are created for capturing features that may serve as potential indicators. Anomaly detectors are then applied to each view and the results are combined to yield an overall suspiciousness ranking of system entities. Unfortunately, there is often a mismatch between what anomaly detection algorithms find and what is actually malicious, which can result in many false positives. This problem is made even worse in the multi-view setting, where only a small subset of the views may be relevant to detecting a particular cyberattack. To help reduce the false positive rate, a key contribution of this paper is to incorporate feedback from security analysts about whether proposed suspicious entities are of interest or likely benign. This feedback is incorporated into subsequent anomaly detection in order to improve the suspiciousness ranking toward entities that are truly of interest to the analyst. For this purpose, we propose an easy to implement variant of the perceptron learning algorithm, which is shown to be quite effective on benchmark datasets. We evaluate our overall approach on real attack data from a DARPA red team exercise, which include multiple attacks on multiple operating systems. The results show that the incorporation of feedback can significantly reduce the time required to identify malicious system entities.
Adaptive Cost-sensitive Online Classification
Zhao, Peilin, Zhang, Yifan, Wu, Min, Hoi, Steven C. H., Tan, Mingkui, Huang, Junzhou
Cost-Sensitive Online Classification has drawn extensive attention in recent years, where the main approach is to directly online optimize two well-known cost-sensitive metrics: (i) weighted sum of sensitivity and specificity; (ii) weighted misclassification cost. However, previous existing methods only considered first-order information of data stream. It is insufficient in practice, since many recent studies have proved that incorporating second-order information enhances the prediction performance of classification models. Thus, we propose a family of cost-sensitive online classification algorithms with adaptive regularization in this paper. We theoretically analyze the proposed algorithms and empirically validate their effectiveness and properties in extensive experiments. Then, for better trade off between the performance and efficiency, we further introduce the sketching technique into our algorithms, which significantly accelerates the computational speed with quite slight performance loss. Finally, we apply our algorithms to tackle several online anomaly detection tasks from real world. Promising results prove that the proposed algorithms are effective and efficient in solving cost-sensitive online classification problems in various real-world domains.
Semi-Supervised Classification for oil reservoir
Li, Yanan, Guo, Haixiang, Paplinski, Andrew P
This paper addresses the general problem of accurate identification of oil reservoirs. Recent improvements in well or borehole logging technology have resulted in an explosive amount of data available for processing. The traditional methods of analysis of the logs characteristics by experts require significant amount of time and money and is no longer practicable. In this paper, we use the semi-supervised learning to solve the problem of ever-increasing amount of unlabelled data available for interpretation. The experts are needed to label only a small amount of the log data. The neural network classifier is first trained with the initial labelled data. Next, batches of unlabelled data are being classified and the samples with the very high class probabilities are being used in the next training session, bootstrapping the classifier. The process of training, classifying, enhancing the labelled data is repeated iteratively until the stopping criteria are met, that is, no more high probability samples are found. We make an empirical study on the well data from Jianghan oil field and test the performance of the neural network semi-supervised classifier. We compare this method with other classifiers. The comparison results show that our neural network semi-supervised classifier is superior to other classification methods.
Using a Classifier Ensemble for Proactive Quality Monitoring and Control: the impact of the choice of classifiers types, selection criterion, and fusion process
Thomas, Philippe, Haouzi, Hind Bril El, Suhner, Marie-Christine, Thomas, André, Zimmermann, Emmanuel, Noyel, Mélanie
In recent times, the manufacturing processes are faced with many external or internal (the increase of customized product rescheduling , process reliability,..) changes. Therefore, monitoring and quality management activities for these manufacturing processes are difficult. Thus, the managers need more proactive approaches to deal with this variability. In this study, a proactive quality monitoring and control approach based on classifiers to predict defect occurrences and provide optimal values for factors critical to the quality processes is proposed. In a previous work (Noyel et al. 2013), the classification approach had been used in order to improve the quality of a lacquering process at a company plant; the results obtained are promising, but the accuracy of the classification model used needs to be improved. One way to achieve this is to construct a committee of classifiers (referred to as an ensemble) to obtain a better predictive model than its constituent models. However, the selection of the best classification methods and the construction of the final ensemble still poses a challenging issue. In this study, we focus and analyze the impact of the choice of classifier types on the accuracy of the classifier ensemble; in addition, we explore the effects of the selection criterion and fusion process on the ensemble accuracy as well. Several fusion scenarios were tested and compared based on a real-world case. Our results show that using an ensemble classification leads to an increase in the accuracy of the classifier models. Consequently, the monitoring and control of the considered real-world case can be improved.
Adaptive Diffusions for Scalable Learning over Graphs
Berberidis, Dimitris, Nikolakopoulos, Athanasios N., Giannakis, Georgios B.
Diffusion-based classifiers such as those relying on the Personalized PageRank and the Heat kernel, enjoy remarkable classification accuracy at modest computational requirements. Their performance however is affected by the extent to which the chosen diffusion captures a typically unknown label propagation mechanism, that can be specific to the underlying graph, and potentially different for each class. The present work introduces a disciplined, data-efficient approach to learning class-specific diffusion functions adapted to the underlying network topology. The novel learning approach leverages the notion of "landing probabilities" of class-specific random walks, which can be computed efficiently, thereby ensuring scalability to large graphs. This is supported by rigorous analysis of the properties of the model as well as the proposed algorithms. Furthermore, a robust version of the classifier facilitates learning even in noisy environments. Classification tests on real networks demonstrate that adapting the diffusion function to the given graph and observed labels, significantly improves the performance over fixed diffusions; reaching -- and many times surpassing -- the classification accuracy of computationally heavier state-of-the-art competing methods, that rely on node embeddings and deep neural networks.
WWE WrestleMania 34: Predictions, Match Card, Preview For 2018 PPV
WrestleMania 34 is WWE's biggest show of 2018 in more ways than one. Wrestling's Super Bowl should feature an attendance of around 75,000, lasting somewhere between six and seven hours long. Every single WWE championship will be on the line Sunday night in New Orleans, and 14 matches are expected to be on the card. Below are predictions for every WrestleMania 34 match. A lot has changed since Lesnar and Reigns fought for the world title in the WrestleMania main event three years ago.
Gaussian Process Subset Scanning for Anomalous Pattern Detection in Non-iid Data
Herlands, William, McFowland, Edward III, Wilson, Andrew Gordon, Neill, Daniel B.
Identifying anomalous patterns in real-world data is essential for understanding where, when, and how systems deviate from their expected dynamics. Yet methods that separately consider the anomalousness of each individual data point have low detection power for subtle, emerging irregularities. Additionally, recent detection techniques based on subset scanning make strong independence assumptions and suffer degraded performance in correlated data. We introduce methods for identifying anomalous patterns in non-iid data by combining Gaussian processes with novel log-likelihood ratio statistic and subset scanning techniques. Our approaches are powerful, interpretable, and can integrate information across multiple data streams. We illustrate their performance on numeric simulations and three open source spatiotemporal datasets of opioid overdose deaths, 311 calls, and storm reports.
Qualit\"atsma{\ss}e bin\"arer Klassifikationen im Bereich kriminalprognostischer Instrumente der vierten Generation
This master's thesis discusses an important issue regarding how algorithmic decision making (ADM) is used in crime forecasting. In America forecasting tools are widely used by judiciary systems for making decisions about risk offenders based on criminal justice for risk offenders. By making use of such tools, the judiciary relies on ADM in order to make error free judgement on offenders. For this purpose, one of the quality measures for machine learning techniques which is widly used, the $AUC$ (area under curve), is compared to and contrasted for results with the $PPV_k$ (positive predictive value). Keeping in view the criticality of judgement along with a high dependency on tools offering ADM, it is necessary to evaluate risk tools that aid in decision making based on algorithms. In this methodology, such an evaluation is conducted by implementing a common machine learning approach called binary classifier, as it determines the binary outcome of the underlying juristic question. This thesis showed that the $PPV_k$ (positive predictive value) technique models the decision of judges much better than the $AUC$. Therefore, this research has investigated whether there exists a classifier for which the $PPV_k$ deviates from $AUC$ by a large proportion. It could be shown that the deviation can rise up to 0.75. In order to test this deviation on an already in used Classifier, data from the fourth generation risk assement tool COMPAS was used. The result were were quite alarming as the two measures derivate from each other by 0.48. In this study, the risk assessment evaluation of the forecasting tools was successfully conducted, carefully reviewed and examined. Additionally, it is also discussed whether such systems used for the purpose of making decisions should be socially accepted or not.
Study Finds Consumer DNA Tests Wrong 40 Percent Of The Time
Popular direct-to-consumer DNA kits that promise to reveal a person's heritage and details about their health provide false information to two in five users, a new study published in the journal Genetics in Medicine suggests. The troubling data comes from research conducted by medical diagnostics company Ambry Genetics. The researchers found that consumer DNA tests can often fall victim to false-positives that result in producing incorrect information. Consumer DNA tests like 23andMe, pictured, produce false-positives for two-in-five people. Consumer DNA tests often do not look at the entirety of an individual's genome. Instead, they use a technique that looks specifically for specific SNP arrays, which can determine certain pieces of information about an individual like a predisposition to a disease.
Joint Optimization Framework for Learning with Noisy Labels
Tanaka, Daiki, Ikami, Daiki, Yamasaki, Toshihiko, Aizawa, Kiyoharu
Deep neural networks (DNNs) trained on large-scale datasets have exhibited significant performance in image classification. Many large-scale datasets are collected from websites, however they tend to contain inaccurate labels that are termed as noisy labels. Training on such noisy labeled datasets causes performance degradation because DNNs easily overfit to noisy labels. To overcome this problem, we propose a joint optimization framework of learning DNN parameters and estimating true labels. Our framework can correct labels during training by alternating update of network parameters and labels. We conduct experiments on the noisy CIFAR-10 datasets and the Clothing1M dataset. The results indicate that our approach significantly outperforms other state-of-the-art methods.