Performance Analysis
Concept Drift Detection and Adaptation with Hierarchical Hypothesis Testing
Yu, Shujian, Abraham, Zubin, Wang, Heng, Shah, Mohak, Príncipe, José C.
Effective techniques for analyzing and detecting changes in streaming data, especially in the era of big data, pose new challenges to the machine learning and the statistics community [1], [2]. As a result, early approaches for detecting statistical changes in a time series (such as change point detection), have had to be extended for online detection of changes in a multivariate data streams [3], [4]. Some of these techniques for detecting the intrinsic change in the relationship of the incoming data streams have been applied to numerous real-world applications, such as fraud detection, user preference prediction and email filtering, [5], [6]. Online classification is another common task performed on streaming multivariate time series data that takes advantage of these statistical relationships to predict a class label at each time index [7]. If the underlying source generating the data is not stationary, the optimal decision rule for the classifier would change over time - a phenomena known as concept drift [8]. Given the impact of concept drift on the predictive performance of an online classifier, there is a need to detect these concept drifts as early as possible. The inability of change point detection approaches to detect these concept drifts, has motivated the need for concept drift detection approaches that not only monitor the join distribution of a multivariate data stream but also changes in its relationship to the class labels of the streaming data. Shujian Yu and José C. Príncipe are with the Department of Electrical and Computer Engineering, University of Florida, Gainesville, FL 32611, USA.
NeuralFDR: Learning Discovery Thresholds from Hypothesis Features
Xia, Fei, Zhang, Martin J., Zou, James, Tse, David
As datasets grow richer, an important challenge is to leverage the full features in the data to maximize the number of useful discoveries while controlling for false positives. We address this problem in the context of multiple hypotheses testing, where for each hypothesis, we observe a p-value along with a set of features specific to that hypothesis. For example, in genetic association studies, each hypothesis tests the correlation between a variant and the trait. We have a rich set of features for each variant (e.g. its location, conservation, epigenetics etc.) which could inform how likely the variant is to have a true association. However popular testing approaches, such as Benjamini-Hochberg's procedure (BH) and independent hypothesis weighting (IHW), either ignore these features or assume that the features are categorical or uni-variate. We propose a new algorithm, NeuralFDR, which automatically learns a discovery threshold as a function of all the hypothesis features. We parametrize the discovery threshold as a neural network, which enables flexible handling of multi-dimensional discrete and continuous features as well as efficient end-to-end optimization. We prove that NeuralFDR has strong false discovery rate (FDR) guarantees, and show that it makes substantially more discoveries in synthetic and real datasets. Moreover, we demonstrate that the learned discovery threshold is directly interpretable.
Prediction Scores as a Window into Classifier Behavior
Katehara, Medha, Beauxis-Aussalet, Emma, Alsallakh, Bilal
Most multi-class classifiers make their prediction for a test sample by scoring the classes and selecting the one with the highest score. Analyzing these prediction scores is useful to understand the classifier behavior and to assess its reliability. We present an interactive visualization that facilitates per-class analysis of these scores. Our system, called Classilist, enables relating these scores to the classification correctness and to the underlying samples and their features. We illustrate how such analysis reveals varying behavior of different classifiers.
Data Analytics for Internal Audit Data Mining Blog - www.dataminingblog.com
This is a guest post from Marcel Baumgartner, Data Analytics Expert at Nestlé S.A. Large publicly listed companies not only have external auditors who check the books, but often also a large community of internal auditors. These collaborators provide the company with a sufficient level of assurance in terms of adherence to internal and external rules and guidelines. This covers financial aspects (spend, invoices, investments, …), human resources (working time, payroll, …) but also production related aspects (e.g. One of the strongest trends observed in internal auditing communities is the more and more widespread use of Data Analytics. The term refers to the use of data, statistical methods and statistical thinking as a way of working, in addition to traditional auditing methods like interviews, document and process reviews, etc.
WWE Survivor Series 2017: Predictions, Match Card For Raw vs. SmackDown PPV
WWE Survivor Series 2017 has quickly turned into a "must-see" pay-per-view. Below are predictions for every match on the WWE Survivor Series card. This could end up being a fun match, but make no mistake, Lesnar is going to win at Survivor Series. The man that defeated Braun Strowman clean less than two months ago isn't going to lose to the much smaller Styles. Don't be surprised if the blue brand's top champion doesn't get much offense in at all before being pinned.
Predictive Independence Testing, Predictive Conditional Independence Testing, and Predictive Graphical Modelling
Burkart, Samuel, Király, Franz J
Testing (conditional) independence of multivariate random variables is a task central to statistical inference and modelling in general - though unfortunately one for which to date there does not exist a practicable workflow. State-of-art workflows suffer from the need for heuristic or subjective manual choices, high computational complexity, or strong parametric assumptions. We address these problems by establishing a theoretical link between multivariate/conditional independence testing, and model comparison in the multivariate predictive modelling aka supervised learning task. This link allows advances in the extensively studied supervised learning workflow to be directly transferred to independence testing workflows - including automated tuning of machine learning type which addresses the need for a heuristic choice, the ability to quantitatively trade-off computational demand with accuracy, and the modern black-box philosophy for checking and interfacing. As a practical implementation of this link between the two workflows, we present a python package 'pcit', which implements our novel multivariate and conditional independence tests, interfacing the supervised learning API of the scikit-learn package. Theory and package also allow for straightforward independence test based learning of graphical model structure. We empirically show that our proposed predictive independence test outperform or are on par to current practice, and the derived graphical model structure learning algorithms asymptotically recover the 'true' graph. This paper, and the 'pcit' package accompanying it, thus provide powerful, scalable, generalizable, and easy-to-use methods for multivariate and conditional independence testing, as well as for graphical model structure learning.
Accelerating Cross-Validation in Multinomial Logistic Regression with $\ell_1$-Regularization
Obuchi, Tomoyuki, Kabashima, Yoshiyuki
We develop an approximate formula for evaluating a cross-validation estimator of predictive likelihood for multinomial logistic regression regularized by an $\ell_1$-norm. This allows us to avoid repeated optimizations required for literally conducting cross-validation; hence, the computational time can be significantly reduced. The formula is derived through a perturbative approach employing the largeness of the data size and the model dimensionality. Its usefulness is demonstrated on simulated data and the ISOLET dataset from the UCI machine learning repository.
Introducing DeepBalance: Random Deep Belief Network Ensembles to Address Class Imbalance
When solving practical classification problems, a practitioner may be faced with class imbalance, meaning that one class has a significantly higher prevalence than the others (also called the majority class). Examples of imbalanced classification problems in the literature include [1], [2], [3], [4]. Class imbalance problems may be exacerbated in the future as we discover new methods to collect rare data and rate of data collection increases. In many class imbalance problems, the minority class is not only the interest, but also carries the higher misclassification cost, which complicates learning [5]. Machine learning classifiers try to find an optimal decision boundary that fits training data. As classifiers generally seek to find the simplest rule that partitions the training data, the simplest rule in imbalanced settings is often always predicting the majority class [6]. Results can be deceptive for such classifiers, as they may achieve high accuracy. For example, in a problem where a minority class occurs 0.1% of the time, an uninformed classifier can achieve 99.9% accuracy by simply always predicting observations as the majority. Thus, the naturally occurring target class distribution is not optimal for learning in highly imbalanced scenarios [7], [8], [9], [10].
Wald-Kernel: Learning to Aggregate Information for Sequential Inference
Sequential hypothesis testing is a desirable decision making strategy in any time sensitive scenario. Compared with fixed sample-size testing, sequential testing is capable of achieving identical probability of error requirements using less samples in average. For a binary detection problem, it is well known that for known density functions accumulating the likelihood ratio statistics is time optimal under a fixed error rate constraint. This paper considers the problem of learning a binary sequential detector from training samples when density functions are unavailable. We formulate the problem as a constrained likelihood ratio estimation which can be solved efficiently through convex optimization by imposing Reproducing Kernel Hilbert Space (RKHS) structure on the log-likelihood ratio function. In addition, we provide a computationally efficient approximated solution for large scale data set. The proposed algorithm, namely Wald-Kernel, is tested on a synthetic data set and two real world data sets, together with previous approaches for likelihood ratio estimation. Our empirical results show that the classifier trained through the proposed technique achieves smaller average sampling cost than previous approaches proposed in the literature for the same error rate.
LIUBoost : Locality Informed Underboosting for Imbalanced Data Classification
Ahmed, Sajid, Rayhan, Farshid, Mahbub, Asif, Jani, Md. Rafsan, Shatabda, Swakkhar, Farid, Dewan Md., Rahman, Chowdhury Mofizur
The problem of class imbalance along with class-overlapping has become a major issue in the domain of supervised learning. Most supervised learning algorithms assume equal cardinality of the classes under consideration while optimizing the cost function and this assumption does not hold true for imbalanced datasets which results in sub-optimal classification. Therefore, various approaches, such as undersampling, oversampling, cost-sensitive learning and ensemble based methods have been proposed for dealing with imbalanced datasets. However, undersampling suffers from information loss, oversampling suffers from increased runtime and potential overfitting while cost-sensitive methods suffer due to inadequately defined cost assignment schemes. In this paper, we propose a novel boosting based method called LIUBoost. LIUBoost uses under sampling for balancing the datasets in every boosting iteration like RUSBoost while incorporating a cost term for every instance based on their hardness into the weight update formula minimizing the information loss introduced by undersampling. LIUBoost has been extensively evaluated on 18 imbalanced datasets and the results indicate significant improvement over existing best performing method RUSBoost.