Accuracy
Latent Laplacian Maximum Entropy Discrimination for Detection of High-Utility Anomalies
Hou, Elizabeth, Sricharan, Kumar, Hero, Alfred O.
Anomaly detection is a very pervasive problem applicable to a variety of domains including network intrusion, fraud detection, and system failures. It is a crucial task in many applications because failure to detect anomalous activity could result in highly undesirable outcomes. For example, (i) detection of anomalous medical claims is important to identify fraud; (ii) detection of fraudulent credit card transactions is necessary to help prevent identity theft; and (iii) detection of abnormal network traffic is necessary to identify hacking. Many techniques have been developed for anomaly detection. These methods can be broadly classified into two categories: (i) rule-based systems, and (ii) statistical datadriven approaches. The rule-based systems are based on domain expertise and look for specific types of anomalies while the data-driven approaches look to identify anomalies by identifying statistically rare patterns. Examples of datadriven methods include parametric methods that assume a known family for the nominal (non-anomalous) distribution and nonparametric methods such as those using unsupervised or semi-supervised support vector machines (SVMs) [1], [2] or based on minimum volume set estimation [3], [4], [5]. The advantage of data-driven approaches over rule-based methods is that they can identify novel types of anomalies that are unknown to the domain expert.
New Fairness Metrics for Recommendation that Embrace Differences
We study fairness in collaborative-filtering recommender systems, which are sensitive to discrimination that exists in historical data. Biased data can lead collaborative filtering methods to make unfair predictions against minority groups of users. We identify the insufficiency of existing fairness metrics and propose four new metrics that address different forms of unfairness. These fairness metrics can be optimized by adding fairness terms to the learning objective. Experiments on synthetic and real data show that our new metrics can better measure fairness than the baseline, and that the fairness objectives effectively help reduce unfairness.
CUSBoost: Cluster-based Under-sampling with Boosting for Imbalanced Classification
Rayhan, Farshid, Ahmed, Sajid, Mahbub, Asif, Jani, Md. Rafsan, Shatabda, Swakkhar, Farid, Dewan Md.
Class imbalance classification is a challenging research problem in data mining and machine learning, as most of the real-life datasets are often imbalanced in nature. Existing learning algorithms maximise the classification accuracy by correctly classifying the majority class, but misclassify the minority class. However, the minority class instances are representing the concept with greater interest than the majority class instances in real-life applications. Recently, several techniques based on sampling methods (under-sampling of the majority class and over-sampling the minority class), cost-sensitive learning methods, and ensemble learning have been used in the literature for classifying imbalanced datasets. In this paper, we introduce a new clustering-based under-sampling approach with boosting (AdaBoost) algorithm, called CUSBoost, for effective imbalanced classification. The proposed algorithm provides an alternative to RUSBoost (random under-sampling with AdaBoost) and SMOTEBoost (synthetic minority over-sampling with AdaBoost) algorithms. We evaluated the performance of CUSBoost algorithm with the state-of-the-art methods based on ensemble learning like AdaBoost, RUSBoost, SMOTEBoost on 13 imbalance binary and multi-class datasets with various imbalance ratios. The experimental results show that the CUSBoost is a promising and effective approach for dealing with highly imbalanced datasets.
Virtual Adversarial Ladder Networks For Semi-supervised Learning
Shinoda, Saki, Worrall, Daniel E., Brostow, Gabriel J.
Semi-supervised learning (SSL) partially circumvents the high cost of labeling data by augmenting a small labeled dataset with a large and relatively cheap unlabeled dataset drawn from the same distribution. This paper offers a novel interpretation of two deep learning-based SSL approaches, ladder networks and virtual adversarial training (VAT), as applying distributional smoothing to their respective latent spaces. We propose a class of models that fuse these approaches. We achieve near-supervised accuracy with high consistency on the MNIST dataset using just 5 labels per class: our best model, ladder with layer-wise virtual adversarial noise (LVAN-LW), achieves 1.42% +/- 0.12 average error rate on the MNIST test set, in comparison with 1.62% +/- 0.65 reported for the ladder network. On adversarial examples generated with L2-normalized fast gradient method, LVAN-LW trained with 5 examples per class achieves average error rate 2.4% +/- 0.3 compared to 68.6% +/- 6.5 for the ladder network and 9.9% +/- 7.5 for VAT.
Oversampling for Imbalanced Learning Based on K-Means and SMOTE
Last, Felix, Douzas, Georgios, Bacao, Fernando
Learning from class-imbalanced data continues to be a common and challenging problem in supervised learning as standard classification algorithms are designed to handle balanced class distributions. While different strategies exist to tackle this problem, methods which generate artificial data to achieve a balanced class distribution are more versatile than modifications to the classification algorithm. Such techniques, called oversamplers, modify the training data, allowing any classifier to be used with class-imbalanced datasets. Many algorithms have been proposed for this task, but most are complex and tend to generate unnecessary noise. This work presents a simple and effective oversampling method based on k-means clustering and SMOTE oversampling, which avoids the generation of noise and effectively overcomes imbalances between and within classes. Empirical results of extensive experiments with 71 datasets show that training data oversampled with the proposed method improves classification results. Moreover, k-means SMOTE consistently outperforms other popular oversampling methods. An implementation is made available in the python programming language.
FDR-Corrected Sparse Canonical Correlation Analysis with Applications to Imaging Genomics
Gossmann, Alexej, Zille, Pascal, Calhoun, Vince, Wang, Yu-Ping
Abstract--Reducing the number of false positive discoveries is presently one of the most pressing issues in the life sciences. It is of especially great importance for many applications in neuroimag-ing and genomics, where datasets are typically high-dimensional, which means that the number of explanatory variables exceeds the sample size. The false discovery rate (FDR) is a criterion that can be employed to address that issue. Thus it has gained great popularity as a tool for testing multiple hypotheses. Canonical correlation analysis (CCA) is a statistical technique that is used to make sense of the cross-correlation of two sets of measurements collected on the same set of samples (e.g., brain imaging and genomic data for the same mental illness patients), and sparse CCA extends the classical method to high-dimensional settings. Here we propose a way of applying the FDR concept to sparse CCA, and a method to control the FDR. The proposed FDR correction directly influences the sparsity of the solution, adapting it to the unknown true sparsity level. Theoretical derivation as well as simulation studies show that our procedure indeed keeps the FDR of the canonical vectors below a user-specified target level. We apply the proposed method to an imaging genomics dataset from the Philadelphia Neurodevelopmental Cohort. Our results link the brain connectivity profiles derived from brain activity during an emotion identification task, as measured by functional magnetic resonance imaging (fMRI), to the corresponding subjects' genomic data. ANONICAL correlation analysis (due to Hotelling, [1]), or CCA, is a classical statistical technique, which is used to make sense of the cross-correlation of two sets of measurements collected on the same set of samples. More precisely, given two sets of random variables, CCA identifies linear combinations of each, which have maximum correlation with each other. The coefficients of these linear combinations of features are called canonical vectors. Like many classical statistical techniques, CCA fails in high-dimensional settings, when the number of variables in either of the two cross-correlated datasets exceeds the number of samples.
Improving Malware Detection Accuracy by Extracting Icon Information
Silva, Pedro, Akhavan-Masouleh, Sepehr, Li, Li
Detecting PE malware files is now commonly approached using statistical and machine learning models. While these models commonly use features extracted from the structure of PE files, we propose that icons from these files can also help better predict malware. We propose an innovative machine learning approach to extract information from icons. Our proposed approach consists of two steps: 1) extracting icon features using summary statics, histogram of gradients (HOG), and a convolutional autoencoder, 2) clustering icons based on the extracted icon features. Using publicly available data and by using machine learning experiments, we show our proposed icon clusters significantly boost the efficacy of malware prediction models. In particular, our experiments show an average accuracy increase of 10% when icon clusters are used in the prediction model.
Capsule Network Performance on Complex Data
Xi, Edgar, Bing, Selina, Jin, Yang
In recent years, convolutional neural networks (CNN) have played an important role in the field of deep learning. Variants of CNN's have proven to be very successful in classification tasks across different domains. However, there are two big drawbacks to CNN's: their failure to take into account of important spatial hierarchies between features, and their lack of rotational invariance. As long as certain key features of an object are present in the test data, CNN's classify the test data as the object, disregarding features' relative spatial orientation to each other. This causes false positives. The lack of rotational invariance in CNN's would cause the network to incorrectly assign the object another label, causing false negatives. To address this concern, Hinton et al. propose a novel type of neural network using the concept of capsules in a recent paper. With the use of dynamic routing and reconstruction regularization, the capsule network model would be both rotation invariant and spatially aware. The capsule network has shown its potential by achieving a state-of-the-art result of 0.25% test error on MNIST without data augmentation such as rotation and scaling, better than the previous baseline of 0.39%. To further test out the application of capsule networks on data with higher dimensionality, we attempt to find the best set of configurations that yield the optimal test error on CIFAR10 dataset.
Fintech trends: The rise of AI Fintech 2017 Recap
This article on 2017's AI comes very recently after Google's AutoML project created an AI child that was smarter than AI built by humans. The'child AI' called NASANet was created by two parent AIs and utilises'reinforcement learning' that enables it to report, learn and improve from its parent AIs. Whilst we'd scheduled an AI recap for 2017 it seems that this has been the most significant development in AI technology this year. We've put together the interesting world of AI as told by our articles over the months of 2017. We'll look at the many applications of AI, what AI is and where it's going.