Goto

Collaborating Authors

 Accuracy


A One-Sided Classification Toolkit with Applications in the Analysis of Spectroscopy Data

arXiv.org Machine Learning

This dissertation investigates the use of one-sided classification algorithms in the application of separating hazardous chlorinated solvents from other materials, based on their Raman spectra. The experimentation is carried out using a new one-sided classification toolkit that was designed and developed from the ground up. In the one-sided classification paradigm, the objective is to separate elements of the target class from all outliers. These one-sided classifiers are generally chosen, in practice, when there is a deficiency of some sort in the training examples. Sometimes outlier examples can be rare, expensive to label, or even entirely absent. However, this author would like to note that they can be equally applicable when outlier examples are plentiful but nonetheless not statistically representative of the complete outlier concept. It is this scenario that is explicitly dealt with in this research work. In these circumstances, one-sided classifiers have been found to be more robust that conventional multi-class classifiers. The term "unexpected" outliers is introduced to represent outlier examples, encountered in the test set, that have been taken from a different distribution to the training set examples. These are examples that are a result of an inadequate representation of all possible outliers in the training set. It can often be impossible to fully characterise outlier examples given the fact that they can represent the immeasurable quantity of "everything else" that is not a target. The findings from this research have shown the potential drawbacks of using conventional multi-class classification algorithms when the test data come from a completely different distribution to that of the training samples.


Partial AUC Maximization via Nonlinear Scoring Functions

arXiv.org Machine Learning

We propose a method for maximizing a partial area under a receiver operating characteristic (ROC) curve (pAUC) for binary classification tasks. In binary classification tasks, accuracy is the most commonly used as a measure of classifier performance. In some applications such as anomaly detection and diagnostic testing, accuracy is not an appropriate measure since prior probabilties are often greatly biased. Although in such cases the pAUC has been utilized as a performance measure, few methods have been proposed for directly maximizing the pAUC. This optimization is achieved by using a scoring function. The conventional approach utilizes a linear function as the scoring function. In contrast we newly introduce nonlinear scoring functions for this purpose. Specifically, we present two types of nonlinear scoring functions based on generative models and deep neural networks. We show experimentally that nonlinear scoring fucntions improve the conventional methods through the application of a binary classification of real and bogus objects obtained with the Hyper Suprime-Cam on the Subaru telescope.


Is your AML solution effective?

#artificialintelligence

In the financial services industry, innovations such as faster payments, together with global regulations and regulatory bodies1 raise a significant question: Do existing risk-rule-based money laundering detection systems really work? Financial institutions may also wonder about the effectiveness of fraud detection, the danger of false positives, structural costs, and the impact on customer transaction experience. According to Gartner, "if money laundering were an economy, it would be the fifth largest in the world." Strikingly, the UN also estimates that the amount of money laundered annually is equivalent to 2-5% of the global GDP. Another report from the Heritage Foundation suggests that complying with anti-money laundering (AML) rules, which require hiring the right talent needed to file suspicious activity reports, costs U.S. companies alone as much as $8 billion a year.


Temenos launches AI-based Suspicious Activity Prevention to combat fraud in real-time - The Fintech Times

#artificialintelligence

Temenos (SIX:TEMN), the banking software company, expanded its financial crime mitigation product to include an AI-based Suspicious Activity Prevention solution protecting banks and their customers from fraud. Demand for financial compliance and the increasing levels of financial crime are putting huge pressure on banks. Their legacy processes have grown so complex with a high level of manual work for screening alerts and other fraud mitigation activities. Each manual step is inefficient and prone to errors. High level of false positive rates exacerbate this problem.


A Fast and Easy Regression Technique for k-NN Classification Without Using Negative Pairs

arXiv.org Machine Learning

This paper proposes an inexpensive way to learn an effective dissimilarity function to be used for $k$-nearest neighbor ($k$-NN) classification. Unlike Mahalanobis metric learning methods that map both query (unlabeled) objects and labeled objects to new coordinates by a single transformation, our method learns a transformation of labeled objects to new points in the feature space whereas query objects are kept in their original coordinates. This method has several advantages over existing distance metric learning methods: (i) In experiments with large document and image datasets, it achieves $k$-NN classification accuracy better than or at least comparable to the state-of-the-art metric learning methods. (ii) The transformation can be learned efficiently by solving a standard ridge regression problem. For document and image datasets, training is often more than two orders of magnitude faster than the fastest metric learning methods tested. This speed-up is also due to the fact that the proposed method eliminates the optimization over "negative" object pairs, i.e., objects whose class labels are different. (iii) The formulation has a theoretical justification in terms of reducing hubness in data.


Accurate and Robust Neural Networks for Security Related Applications Exampled by Face Morphing Attacks

arXiv.org Artificial Intelligence

Artificial neural networks tend to learn only what they need for a task. A manipulation of the training data can counter this phenomenon. In this paper, we study the effect of different alterations of the training data, which limit the amount and position of information that is available for the decision making. We analyze the accuracy and robustness against semantic and black box attacks on the networks that were trained on different training data modifications for the particular example of morphing attacks. A morphing attack is an attack on a biometric facial recognition system where the system is fooled to match two different individuals with the same synthetic face image. Such a synthetic image can be created by aligning and blending images of the two individuals that should be matched with this image.


ActiveRemediation: The Search for Lead Pipes in Flint, Michigan

arXiv.org Machine Learning

We detail our ongoing work in Flint, Michigan to detect pipes made of lead and other hazardous metals. After elevated levels of lead were detected in residents' drinking water, followed by an increase in blood lead levels in area children, the state and federal governments directed over $125 million to replace water service lines, the pipes connecting each home to the water system. In the absence of accurate records, and with the high cost of determining buried pipe materials, we put forth a number of predictive and procedural tools to aid in the search and removal of lead infrastructure. Alongside these statistical and machine learning approaches, we describe our interactions with government officials in recommending homes for both inspection and replacement, with a focus on the statistical model that adapts to incoming information. Finally, in light of discussions about increased spending on infrastructure development by the federal government, we explore how our approach generalizes beyond Flint to other municipalities nationwide.


Reconstructing networks with unknown and heterogeneous errors

arXiv.org Machine Learning

The vast majority of network datasets contains errors and omissions, although this is rarely incorporated in traditional network analysis. Recently, an increasing effort has been made to fill this methodological gap by developing network reconstruction approaches based on Bayesian inference. These approaches, however, rely on assumptions of uniform error rates and on direct estimations of the existence of each edge via repeated measurements, something that is currently unavailable for the majority of network data. Here we develop a Bayesian reconstruction approach that lifts these limitations by not only allowing for heterogeneous errors, but also for individual edge measurements without direct error estimates. Our approach works by coupling the inference approach with structured generative network models, which enable the correlations between edges to be used as reliable error estimates. Although our approach is general, we focus on the stochastic block model as the basic generative process, from which efficient nonparametric inference can be performed, and yields a principled method to infer hierarchical community structure from noisy data. We demonstrate the efficacy of our approach with a variety of empirical and artificial networks.


A Taxonomy and Survey of Intrusion Detection System Design Techniques, Network Threats and Datasets

arXiv.org Artificial Intelligence

With the world moving towards being increasingly dependent on computers and automation, one of the main challenges in the current decade has been to build secure applications, systems and networks. Alongside these challenges, the number of threats is rising exponentially due to the attack surface increasing through numerous interfaces offered for each service. To alleviate the impact of these threats, researchers have proposed numerous solutions; however, current tools often fail to adapt to ever-changing architectures, associated threats and 0-days. This manuscript aims to provide researchers with a taxonomy and survey of current dataset composition and current Intrusion Detection Systems (IDS) capabilities and assets. These taxonomies and surveys aim to improve both the efficiency of IDS and the creation of datasets to build the next generation IDS as well as to reflect networks threats more accurately in future datasets. To this end, this manuscript also provides a taxonomy and survey or network threats and associated tools. The manuscript highlights that current IDS only cover 25% of our threat taxonomy, while current datasets demonstrate clear lack of real-network threats and attack representation, but rather include a large number of deprecated threats, hence limiting the accuracy of current machine learning IDS. Moreover, the taxonomies are open-sourced to allow public contributions through a Github repository.


Data-driven model for the identification of the rock type at a drilling bit

arXiv.org Machine Learning

In order to bridge the gap of more than 15m between the drilling bit and high-fidelity rock type sensors during the directional drilling, we present a novel approach for identifying rock type at the drilling bit. The approach is based on application of machine learning techniques for Measurements While Drilling (MWD) data. We demonstrate capabilities of the developed approach for distinguishing between the rock types corresponding to (1) a target oil bearing interval of a reservoir and (2) a non-productive shale layer and compare it to more traditional physics-driven approaches. The dataset includes MWD data and lithology mapping along multiple wellbores obtained by processing of Logging While Drilling (LWD) measurements from a massive drilling effort on one of the major newly developed oilfield in the North of Western Siberia. We compare various machine-learning algorithms, examine extra features coming from physical modeling of drilling mechanics, and show that the classification error can be reduced from 13.5% to 9%.