Goto

Collaborating Authors

 Performance Analysis


Facility Locations Utility for Uncovering Classifier Overconfidence

arXiv.org Machine Learning

Assessing the predictive accuracy of black box classifiers is challenging in the absence of labeled test datasets. In these scenarios we may need to rely on a human oracle to evaluate individual predictions; presenting the challenge to create query algorithms to guide the search for points that provide the most information about the classifier's predictive characteristics. Previous works have focused on developing utility models and query algorithms for discovering unknown unknowns --- misclassifications with a predictive confidence above some arbitrary threshold. However, if misclassifications occur at the rate reflected by the confidence values, then these search methods reveal nothing more than a proper assessment of predictive certainty. We are unable to properly mitigate the risks associated with model deficiency when the model's confidence in prediction exceeds the actual model accuracy. We propose a facility locations utility model and corresponding greedy query algorithm that instead searches for overconfident unknown unknowns. Through robust empirical experiments we demonstrate that the greedy query algorithm with the facility locations utility model consistently results in oracle queries with superior performance in discovering overconfident unknown unknowns than previous methods.


Fast Construction of Correcting Ensembles for Legacy Artificial Intelligence Systems: Algorithms and a Case Study

arXiv.org Artificial Intelligence

This paper presents a technology for simple and computationally efficient improvements of a generic Artificial Intelligence (AI) system, including Multilayer and Deep Learning neural networks. The improvements are, in essence, small network ensembles constructed on top of the existing AI architectures. Theoretical foundations of the technology are based on Stochastic Separation Theorems and the ideas of the concentration of measure. We show that, subject to mild technical assumptions on statistical properties of internal signals in the original AI system, the technology enables instantaneous and computationally efficient removal of spurious and systematic errors with probability close to one on the datasets which are exponentially large in dimension. The method is illustrated with numerical examples and a case study of ten digits recognition from American Sign Language.


Security and Privacy considerations in Artificial Intelligence & Machine Learning -- Part 4: Theโ€ฆ

#artificialintelligence

Note: This is part-4 of a series of articles on'Security and Privacy in Artificial Intelligence & Machine Learning'. In this article we will take a closer look at use of AI&ML in various security-related use cases. We will cover not only cybersecurity but also some general security scenarios and how solutions based on AI&ML are becoming increasingly prevalent in all such areas. Towards the end we will also explore ways that attackers are likely to circumvent these security techniques. So let us begin with a look at some interesting areas where security features are benefiting from ML&AI. When we consider cybersecurity, one of the most common areas where AI&ML gets a mention is addressing the'needle in the haystack' problem in security of a large scale environment.


A gentle introduction to decision trees using R

#artificialintelligence

Most techniques of predictive analytics have their origins in probability or statistical theory (see my post on Naรฏve Bayes, for example). In this post I'll look at one that has more a commonplace origin: the way in which humans make decisions. When making decisions, we typically identify the options available and then evaluate them based on criteria that are important to us. The intuitive appeal of such a procedure is in no small measure due to the fact that it can be easily explained through a visual. The tree structure depicted here provides a neat, easy-to-follow description of the issue under consideration and its resolution.


MDGAN: Boosting Anomaly Detection Using \\Multi-Discriminator Generative Adversarial Networks

arXiv.org Machine Learning

Anomaly detection is often considered a challenging field of machine learning due to the difficulty of obtaining anomalous samples for training and the need to obtain a sufficient amount of training data. In recent years, autoencoders have been shown to be effective anomaly detectors that train only on "normal" data. Generative adversarial networks (GANs) have been used to generate additional training samples for classifiers, thus making them more accurate and robust. However, in anomaly detection GANs are only used to reconstruct existing samples rather than to generate additional ones. This stems both from the small amount and lack of diversity of anomalous data in most domains. In this study we propose MDGAN, a novel GAN architecture for improving anomaly detection through the generation of additional samples. Our approach uses two discriminators: a dense network for determining whether the generated samples are of sufficient quality (i.e., valid) and an autoencoder that serves as an anomaly detector. MDGAN enables us to reconcile two conflicting goals: 1) generate high-quality samples that can fool the first discriminator, and 2) generate samples that can eventually be effectively reconstructed by the second discriminator, thus improving its performance. Empirical evaluation on a diverse set of datasets demonstrates the merits of our approach.


Systematic Biases in Link Prediction: comparing heuristic and graph embedding based methods

arXiv.org Machine Learning

Link prediction is a popular research topic in network analysis. In the last few years, new techniques based on graph embedding have emerged as a powerful alternative to heuristics. In this article, we study the problem of systematic biases in the prediction, and show that some methods based on graph embedding offer less biased results than those based on heuristics, despite reaching lower scores according to usual quality scores. We discuss the relevance of this finding in the context of the filter bubble problem and the algorithmic fairness of recommender systems.


The Impact of Annotation Guidelines and Annotated Data on Extracting App Features from App Reviews

arXiv.org Machine Learning

Annotation guidelines used to guide the annotation of training and evaluation datasets can have a considerable impact on the quality of machine learning models. In this study, we explore the effects of annotation guidelines on the quality of app feature extraction models. As a main result, we propose several changes to the existing annotation guidelines with a goal of making the extracted app features more useful and informative to the app developers. We test the proposed changes via simulating the application of the new annotation guidelines and then evaluating the performance of the supervised machine learning models trained on datasets annotated with initial and simulated guidelines. While the overall performance of automatic app feature extraction remains the same as compared to the model trained on the dataset with initial annotations, the features extracted by the model trained on the dataset with simulated new annotations are less noisy and more informative to the app developers. Secondly, we are interested in what kind of annotated training data is necessary for training an automatic app feature extraction model. In particular, we explore whether the training set should contain annotated app reviews from those apps/app categories on which the model is subsequently planned to be applied, or is it sufficient to have annotated app reviews from any app available for training, even when these apps are from very different categories compared to the test app. Our experiments show that having annotated training reviews from the test app is not necessary although including them into training set helps to improve recall. Furthermore, we test whether augmenting the training set with annotated product reviews helps to improve the performance of app feature extraction. We find that the models trained on augmented training set lead to improved recall but at the cost of the drop in precision.


Panda: AdaPtive Noisy Data Augmentation for Regularization of Undirected Graphical Models

arXiv.org Machine Learning

We propose PANDA, an AdaPtive Noise Augmentation technique to regularize estimating and constructing undirected graphical models (UGMs). PANDA iteratively solves MLEs given noise augmented data in the regression-based framework until convergence to achieve the designed regularization effects. The augmented noises can be designed to achieve various regularization effects on graph estimation, including the bridge, elastic net, adaptive lasso, and SCAD penalization; it can also offer group lasso and fused ridge when some nodes belong to the same group. We establish theoretically that the noise-augmented loss functions and its minimizer converge almost surely to the expected penalized loss function and its minimizer, respectively. We derive the asymptotic distributions for the regularized regression coefficients through PANDA in GLMs, based on which, the inferences for the parameters can be obtained simultaneously with variable selection. Our empirical results suggest the inferences achieve nominal or near-nominal coverage and are far more efficient compared to some existing post-selection procedures. On the algorithm level, PANDA can be easily programmed in any standard software without resorting to complicated optimization techniques. We show the non-inferior performance of PANDA in constructing graphs of different types in simulation studies and also apply PANDA to the autism spectrum disorder data to construct a mixed-node graph.



ET-Lasso: Efficient Tuning of Lasso for High-Dimensional Data

arXiv.org Machine Learning

The L1 regularization (Lasso) has proven to be a versatile tool to select relevant features and estimate the model coefficients simultaneously. Despite its popularity, it is very challenging to guarantee the feature selection consistency of Lasso. One way to improve the feature selection consistency is to select an ideal tuning parameter. Traditional tuning criteria mainly focus on minimizing the estimated prediction error or maximizing the posterior model probability, such as cross-validation and BIC, which may either be time-consuming or fail to control the false discovery rate (FDR) when the number of features is extremely large. The other way is to introduce pseudo-features to learn the importance of the original ones. Recently, the Knockoff filter is proposed to control the FDR when performing feature selection. However, its performance is sensitive to the choice of the expected FDR threshold. Motivated by these ideas, we propose a new method using pseudo-features to obtain an ideal tuning parameter. In particular, we present the Efficient Tuning of Lasso (ET-Lasso) to separate active and inactive features by adding permuted features as pseudo-features in linear models. The pseudo-features are constructed to be inactive by nature, which can be used to obtain a cutoff to select the tuning parameter that separates active and inactive features. Experimental studies on both simulations and real-world data applications are provided to show that ET-Lasso can effectively and efficiently select active features under a wide range of different scenarios.