Goto

Collaborating Authors

 Accuracy


NeuralFDR: Learning Discovery Thresholds from Hypothesis Features

arXiv.org Machine Learning

As datasets grow richer, an important challenge is to leverage the full features in the data to maximize the number of useful discoveries while controlling for false positives. We address this problem in the context of multiple hypotheses testing, where for each hypothesis, we observe a p-value along with a set of features specific to that hypothesis. For example, in genetic association studies, each hypothesis tests the correlation between a variant and the trait. We have a rich set of features for each variant (e.g. its location, conservation, epigenetics etc.) which could inform how likely the variant is to have a true association. However popular testing approaches, such as Benjamini-Hochberg's procedure (BH) and independent hypothesis weighting (IHW), either ignore these features or assume that the features are categorical or uni-variate. We propose a new algorithm, NeuralFDR, which automatically learns a discovery threshold as a function of all the hypothesis features. We parametrize the discovery threshold as a neural network, which enables flexible handling of multi-dimensional discrete and continuous features as well as efficient end-to-end optimization. We prove that NeuralFDR has strong false discovery rate (FDR) guarantees, and show that it makes substantially more discoveries in synthetic and real datasets. Moreover, we demonstrate that the learned discovery threshold is directly interpretable.


Prediction Scores as a Window into Classifier Behavior

arXiv.org Machine Learning

Most multi-class classifiers make their prediction for a test sample by scoring the classes and selecting the one with the highest score. Analyzing these prediction scores is useful to understand the classifier behavior and to assess its reliability. We present an interactive visualization that facilitates per-class analysis of these scores. Our system, called Classilist, enables relating these scores to the classification correctness and to the underlying samples and their features. We illustrate how such analysis reveals varying behavior of different classifiers.


Data Analytics for Internal Audit Data Mining Blog - www.dataminingblog.com

@machinelearnbot

This is a guest post from Marcel Baumgartner, Data Analytics Expert at Nestlรฉ S.A. Large publicly listed companies not only have external auditors who check the books, but often also a large community of internal auditors. These collaborators provide the company with a sufficient level of assurance in terms of adherence to internal and external rules and guidelines. This covers financial aspects (spend, invoices, investments, โ€ฆ), human resources (working time, payroll, โ€ฆ) but also production related aspects (e.g. One of the strongest trends observed in internal auditing communities is the more and more widespread use of Data Analytics. The term refers to the use of data, statistical methods and statistical thinking as a way of working, in addition to traditional auditing methods like interviews, document and process reviews, etc.


WWE Survivor Series 2017: Predictions, Match Card For Raw vs. SmackDown PPV

International Business Times

WWE Survivor Series 2017 has quickly turned into a "must-see" pay-per-view. Below are predictions for every match on the WWE Survivor Series card. This could end up being a fun match, but make no mistake, Lesnar is going to win at Survivor Series. The man that defeated Braun Strowman clean less than two months ago isn't going to lose to the much smaller Styles. Don't be surprised if the blue brand's top champion doesn't get much offense in at all before being pinned.


Predictive Independence Testing, Predictive Conditional Independence Testing, and Predictive Graphical Modelling

arXiv.org Machine Learning

Testing (conditional) independence of multivariate random variables is a task central to statistical inference and modelling in general - though unfortunately one for which to date there does not exist a practicable workflow. State-of-art workflows suffer from the need for heuristic or subjective manual choices, high computational complexity, or strong parametric assumptions. We address these problems by establishing a theoretical link between multivariate/conditional independence testing, and model comparison in the multivariate predictive modelling aka supervised learning task. This link allows advances in the extensively studied supervised learning workflow to be directly transferred to independence testing workflows - including automated tuning of machine learning type which addresses the need for a heuristic choice, the ability to quantitatively trade-off computational demand with accuracy, and the modern black-box philosophy for checking and interfacing. As a practical implementation of this link between the two workflows, we present a python package 'pcit', which implements our novel multivariate and conditional independence tests, interfacing the supervised learning API of the scikit-learn package. Theory and package also allow for straightforward independence test based learning of graphical model structure. We empirically show that our proposed predictive independence test outperform or are on par to current practice, and the derived graphical model structure learning algorithms asymptotically recover the 'true' graph. This paper, and the 'pcit' package accompanying it, thus provide powerful, scalable, generalizable, and easy-to-use methods for multivariate and conditional independence testing, as well as for graphical model structure learning.


Introducing DeepBalance: Random Deep Belief Network Ensembles to Address Class Imbalance

arXiv.org Machine Learning

When solving practical classification problems, a practitioner may be faced with class imbalance, meaning that one class has a significantly higher prevalence than the others (also called the majority class). Examples of imbalanced classification problems in the literature include [1], [2], [3], [4]. Class imbalance problems may be exacerbated in the future as we discover new methods to collect rare data and rate of data collection increases. In many class imbalance problems, the minority class is not only the interest, but also carries the higher misclassification cost, which complicates learning [5]. Machine learning classifiers try to find an optimal decision boundary that fits training data. As classifiers generally seek to find the simplest rule that partitions the training data, the simplest rule in imbalanced settings is often always predicting the majority class [6]. Results can be deceptive for such classifiers, as they may achieve high accuracy. For example, in a problem where a minority class occurs 0.1% of the time, an uninformed classifier can achieve 99.9% accuracy by simply always predicting observations as the majority. Thus, the naturally occurring target class distribution is not optimal for learning in highly imbalanced scenarios [7], [8], [9], [10].


Global Bigdata Conference

#artificialintelligence

Cybercrime is on the rise, and organizations across a wide variety of industries -- from financial institutions to insurance, health care providers, and large e-retailers -- are rightfully worried. In the first half of 2017 alone, over 2 billion records were compromised. After stealing PII (personally identifiable information) from these hacks, fraudsters can gain access to customer accounts, create synthetic identities, and even craft phony business profiles to commit various forms of fraud. Naturally, companies are frantically looking to beef up their security teams. A large skills gap is causing hiring difficulties in the cybersecurity industry, so much so that the Information Systems Audit and Control Association found that less than one in four candidates who apply for cybersecurity jobs are qualified.


LIUBoost : Locality Informed Underboosting for Imbalanced Data Classification

arXiv.org Machine Learning

The problem of class imbalance along with class-overlapping has become a major issue in the domain of supervised learning. Most supervised learning algorithms assume equal cardinality of the classes under consideration while optimizing the cost function and this assumption does not hold true for imbalanced datasets which results in sub-optimal classification. Therefore, various approaches, such as undersampling, oversampling, cost-sensitive learning and ensemble based methods have been proposed for dealing with imbalanced datasets. However, undersampling suffers from information loss, oversampling suffers from increased runtime and potential overfitting while cost-sensitive methods suffer due to inadequately defined cost assignment schemes. In this paper, we propose a novel boosting based method called LIUBoost. LIUBoost uses under sampling for balancing the datasets in every boosting iteration like RUSBoost while incorporating a cost term for every instance based on their hardness into the weight update formula minimizing the information loss introduced by undersampling. LIUBoost has been extensively evaluated on 18 imbalanced datasets and the results indicate significant improvement over existing best performing method RUSBoost.


pyLEMMINGS: Large Margin Multiple Instance Classification and Ranking for Bioinformatics Applications

arXiv.org Machine Learning

Motivation: A major challenge in the development of machine learning based methods in computational biology is that data may not be accurately labeled due to the time and resources required for experimentally annotating properties of proteins and DNA sequences. Standard supervised learning algorithms assume accurate instance-level labeling of training data. Multiple instance learning is a paradigm for handling such labeling ambiguities. However, the widely used large-margin classification methods for multiple instance learning are heuristic in nature with high computational requirements. In this paper, we present stochastic sub-gradient optimization large margin algorithms for multiple instance classification and ranking, and provide them in a software suite called pyLEMMINGS. Results: We have tested pyLEMMINGS on a number of bioinformatics problems as well as benchmark datasets. pyLEMMINGS has successfully been able to identify functionally important segments of proteins: binding sites in Calmodulin binding proteins, prion forming regions, and amyloid cores. pyLEMMINGS achieves state-of-the-art performance in all these tasks, demonstrating the value of multiple instance learning. Furthermore, our method has shown more than 100-fold improvement in terms of running time as compared to heuristic solutions with improved accuracy over benchmark datasets. Availability and Implementation: pyLEMMINGS python package is available for download at: http://faculty.pieas.edu.pk/fayyaz/software.html#pylemmings.


Calibrated Boosting-Forest

arXiv.org Machine Learning

Excellent ranking power along with well calibrated probability estimates are needed in many classification tasks. In this paper, we introduce a technique, Calibrated Boosting-Forest that captures both. This novel technique is an ensemble of gradient boosting machines that can support both continuous and binary labels. While offering superior ranking power over any individual regression or classification model, Calibrated Boosting-Forest is able to preserve well calibrated posterior probabilities. Along with these benefits, we provide an alternative to the tedious step of tuning gradient boosting machines. We demonstrate that tuning Calibrated Boosting-Forest can be reduced to a simple hyper-parameter selection. We further establish that increasing this hyper-parameter improves the ranking performance under a diminishing return. We examine the effectiveness of Calibrated Boosting-Forest on ligand-based virtual screening where both continuous and binary labels are available and compare the performance of Calibrated Boosting-Forest with logistic regression, gradient boosting machine and deep learning. Calibrated Boosting-Forest achieved an approximately 48% improvement compared to a state-of-art deep learning model. Moreover, it achieved around 95% improvement on probability quality measurement compared to the best individual gradient boosting machine. Calibrated Boosting-Forest offers a benchmark demonstration that in the field of ligand-based virtual screening, deep learning is not the universally dominant machine learning model and good calibrated probabilities can better facilitate virtual screening process.