Goto

Collaborating Authors

 Performance Analysis


Classifier Suites for Insider Threat Detection

arXiv.org Machine Learning

Better methods to detect insider threats need new anticipatory analytics to capture risky behavior prior to losing data. In search of the best overall classifier, this work empirically scores 88 machine learning algorithms in 16 major families. We extract risk features from the large CERT dataset, which blends real network behavior with individual threat narratives. We discover the predictive importance of measuring employee sentiment. Among major classifier families tested on CERT, the random forest algorithms offer the best choice, with different implementations scoring over 98% accurate. In contrast to more obscure or black-box alternatives, random forests are ensembles of many decision trees and thus offer a deep but human-readable set of detection rules (>2000 rules). We address performance rankings by penalizing long execution times against higher median accuracies using cross-fold validation. We address the relative rarity of threats as a case of low signal-to-noise (< 0.02% malicious to benign activities), and then train on both under-sampled and over-sampled data which is statistically balanced to identify nefarious actors.


Forget Finding Nemo: This AI can identify a single zebrafish out of a 100-strong shoal

#artificialintelligence

AI systems excel in pattern recognition, so much so that they can stalk individual zebrafish and fruit flies even when the animals are in groups of up to a hundred. To demonstrate this, a group of researchers from the Champalimaud Foundation, a private biomedical research lab in Portugal, trained two convolutional neural networks to identify and track individual animals within a group. The aim is not so much to match or exceed humans' ability to spot and follow stuff, but rather to automate the process of studying the behavior of animals in their communities. "The ultimate goal of our team is understanding group behavior," said Gonzalo de Polavieja. "We want to understand how animals in a group decide together and learn together."


Machine Learning Makes Amazon Alexa Smarter, Reduces Error Rate By 8%

#artificialintelligence

Alexa, Amazon's poster child for connected home devices, has recently received yet another update. Researchers have developed a method of implementing improved natural language processing in the model, cutting the error rate by 8%. This was done through a combination of transfer learning and utilising AI to generate'embeddings' of words. Transfer learning was implemented with a neural network to train an AI on a large dataset of annotated speech samples, thus enabling researchers to bootstrap training in a new domain with sparse data. This technique taps into millions of unnannotated interactions with Alexa, fueled by data from users of Amazon's Echo products.


A semi-supervised approach to message stance classification

arXiv.org Machine Learning

Social media communications are becoming increasingly prevalent; some useful, some false, whether unwittingly or maliciously. An increasing number of rumours daily flood the social networks. Determining their veracity in an autonomous way is a very active and challenging field of research, with a variety of methods proposed. However, most of the models rely on determining the constituent messages' stance towards the rumour, a feature known as the "wisdom of the crowd". Although several supervised machine-learning approaches have been proposed to tackle the message stance classification problem, these have numerous shortcomings. In this paper we argue that semi-supervised learning is more effective than supervised models and use two graph-based methods to demonstrate it. This is not only in terms of classification accuracy, but equally important, in terms of speed and scalability. We use the Label Propagation and Label Spreading algorithms and run experiments on a dataset of 72 rumours and hundreds of thousands messages collected from Twitter. We compare our results on two available datasets to the state-of-the-art to demonstrate our algorithms' performance regarding accuracy, speed and scalability for real-time applications.


Learning Choice Functions

arXiv.org Machine Learning

We study the problem of learning choice functions, which play an important role in various domains of application, most notably in the field of economics. Formally, a choice function is a mapping from sets to sets: Given a set of choice alternatives as input, a choice function identifies a subset of most preferred elements. Learning choice functions from suitable training data comes with a number of challenges. For example, the sets provided as input and the subsets produced as output can be of any size. Moreover, since the order in which alternatives are presented is irrelevant, a choice function should be symmetric. Perhaps most importantly, choice functions are naturally context-dependent, in the sense that the preference in favor of an alternative may depend on what other options are available. We formalize the problem of learning choice functions and present two general approaches based on two representations of context-dependent utility functions. Both approaches are instantiated by means of appropriate neural network architectures, and their performance is demonstrated on suitable benchmark tasks.


Repairing without Retraining: Avoiding Disparate Impact with Counterfactual Distributions

arXiv.org Machine Learning

When the average performance of a prediction model varies significantly with respect to a sensitive attribute (e.g., race or gender), the performance disparity can be expressed in terms of the probability distributions of input and output variables for each sensitive group. In this paper, we exploit this fact to explain and repair the performance disparity of a fixed classification model over a population of interest. Given a black-box classifier that performs unevenly across sensitive groups, we aim to eliminate the performance gap by perturbing the distribution of input features for the disadvantaged group. We refer to the perturbed distribution as a counterfactual distribution, and characterize its properties for popular fairness criteria (e.g., predictive parity, equal FPR, equal opportunity). We then design a descent algorithm to efficiently learn a counterfactual distribution given the black-box classifier and samples drawn from the underlying population. We use the estimated counterfactual distribution to build a data preprocessor that reduces disparate impact without training a new model. We illustrate these use cases through experiments on real-world datasets, showing that we can repair different kinds of disparate impact without a large drop in accuracy.


Improved Adversarial Learning for Fair Classification

arXiv.org Machine Learning

Motivated by concerns that machine learning algorithms may introduce significant bias in classification models, developing fair classifiers has become an important problem in machine learning research. One important paradigm towards this has been providing algorithms for adversarially learning fair classifiers (Zhang et al., 2018; Madras et al., 2018). We formulate the adversarial learning problem as a multi-objective optimization problem and find the fair model using gradient descent-ascent algorithm with a modified gradient update step, inspired by the approach of Zhang et al., 2018. We provide theoretical insight and guarantees that formalize the heuristic arguments presented previously towards taking such an approach. We test our approach empirically on the Adult dataset and synthetic datasets and compare against state of the art algorithms (Celis et al., 2018; Zhang et al., 2018; Zafar et al., 2017). The results show that our models and algorithms have comparable or better accuracy than other algorithms while performing better in terms of fairness, as measured using statistical rate or false discovery rate.


Limitations of Assessing Active Learning Performance at Runtime

arXiv.org Machine Learning

Classification algorithms aim to predict an unknown label (e.g., a quality class) for a new instance (e.g., a product). Therefore, training samples (instances and labels) are used to deduct classification hypotheses. Often, it is relatively easy to capture instances but the acquisition of the corresponding labels remain difficult or expensive. Active learning algorithms select the most beneficial instances to be labeled to reduce cost. In research, this labeling procedure is simulated and therefore a ground truth is available. But during deployment, active learning is a one-shot problem and an evaluation set is not available. Hence, it is not possible to reliably estimate the performance of the classification system during learning and it is difficult to decide when the system fulfills the quality requirements (stopping criteria). In this article, we formalize the task and review existing strategies to assess the performance of an actively trained classifier during training. Furthermore, we identified three major challenges: 1)~to derive a performance distribution, 2)~to preserve representativeness of the labeled subset, and 3) to correct against sampling bias induced by an intelligent selection strategy. In a qualitative analysis, we evaluate different existing approaches and show that none of them reliably estimates active learning performance stating a major challenge for future research for such systems. All plots and experiments are provided in a Jupyter notebook that is available for download.


Bayes Imbalance Impact Index: A Measure of Class Imbalanced Dataset for Classification Problem

arXiv.org Machine Learning

Recent studies have shown that imbalance ratio is not the only cause of the performance loss of a classifier in imbalanced data classification. In fact, other data factors, such as small disjuncts, noises and overlapping, also play the roles in tandem with imbalance ratio, which makes the problem difficult. Thus far, the empirical studies have demonstrated the relationship between the imbalance ratio and other data factors only. To the best of our knowledge, there is no any measurement about the extent of influence of class imbalance on the classification performance of imbalanced data. Further, it is also unknown for a dataset which data factor is actually the main barrier for classification. In this paper, we focus on Bayes optimal classifier and study the influence of class imbalance from a theoretical perspective. Accordingly, we propose an instance measure called Individual Bayes Imbalance Impact Index ($IBI^3$) and a data measure called Bayes Imbalance Impact Index ($BI^3$). $IBI^3$ and $BI^3$ reflect the extent of influence purely by the factor of imbalance in terms of each minority class sample and the whole dataset, respectively. Therefore, $IBI^3$ can be used as an instance complexity measure of imbalance and $BI^3$ is a criterion to show the degree of how imbalance deteriorates the classification. As a result, we can therefore use $BI^3$ to judge whether it is worth using imbalance recovery methods like sampling or cost-sensitive methods to recover the performance loss of a classifier. The experiments show that $IBI^3$ is highly consistent with the increase of prediction score made by the imbalance recovery methods and $BI^3$ is highly consistent with the improvement of F1 score made by the imbalance recovery methods on both synthetic and real benchmark datasets.


SteganoGAN: High Capacity Image Steganography with GANs

arXiv.org Machine Learning

Image steganography is a procedure for hiding messages inside pictures. While other techniques such as cryptography aim to prevent adversaries from reading the secret message, steganography aims to hide the presence of the message itself. In this paper, we propose a novel technique for hiding arbitrary binary data in images using generative adversarial networks which allow us to optimize the perceptual quality of the images produced by our model. We show that our approach achieves state-of-the-art payloads of 4.4 bits per pixel, evades detection by steganalysis tools, and is effective on images from multiple datasets. To enable fair comparisons, we have released an open source library that is available online at https://github.com/DAI-Lab/SteganoGAN.