Accuracy
Appropriateness of Performance Indices for Imbalanced Data Classification: An Analysis
Mullick, Sankha Subhra, Datta, Shounak, Dhekane, Sourish Gunesh, Das, Swagatam
Indices quantifying the performance of classifiers under class-imbalance, often suffer from distortions depending on the constitution of the test set or the class-specific classification accuracy, creating difficulties in assessing the merit of the classifier. We identify two fundamental conditions that a performance index must satisfy to be respectively resilient to altering number of testing instances from each class and the number of classes in the test set. In light of these conditions, under the effect of class imbalance, we theoretically analyze four indices commonly used for evaluating binary classifiers and five popular indices for multi-class classifiers. For indices violating any of the conditions, we also suggest remedial modification and normalization. We further investigate the capability of the indices to retain information about the classification performance over all the classes, even when the classifier exhibits extreme performance on some classes. Simulation studies are performed on high dimensional deep representations of subset of the ImageNet dataset using four state-of-the-art classifiers tailored for handling class imbalance. Finally, based on our theoretical findings and empirical evidence, we recommend the appropriate indices that should be used to evaluate the performance of classifiers in presence of class-imbalance.
Improving Fairness in Criminal Justice Algorithmic Risk Assessments Using Conformal Prediction Sets
Berk, Richard A., Kuchibhotla, Arun Kumar
Risk assessment algorithms have been correctly criticized for potential unfairness, and there is an active cottage industry trying to make repairs. In this paper, we adopt a framework from conformal prediction sets to remove unfairness from risk algorithms themselves and the covariates used for forecasting. From a sample of 300,000 offenders at their arraignments, we construct a confusion table and its derived measures of fairness that are effectively free any meaningful differences between Black and White offenders. We also produce fair forecasts for individual offenders coupled with valid probability guarantees that the forecasted outcome is the true outcome. We see our work as a demonstration of concept for application in a wide variety of criminal justice decisions. The procedures provided can be routinely implemented in jurisdictions with the usual criminal justice datasets used by administrators. The requisite procedures can be found in the scripting software R. However, whether stakeholders will accept our approach as a means to achieve risk assessment fairness is unknown. There also are legal issues that would need to be resolved although we offer a Pareto improvement.
Plotting a Confusion Matrix- Machine Learning in Python
In this blog post, I will be explaining how to plot confusion matrices in Python. This is my second blog post on the Confusion Matrix. If you want to understand what a confusion matrix is and how to get insights from the confusion matrix, check out my first blog post. I have attached the link below. Now, without further due, let's dive into how to plot a confusion matrix.
CareCall: a Call-Based Active Monitoring Dialog Agent for Managing COVID-19 Pandemic
Lee, Sang-Woo, Jung, Hyunhoon, Ko, SukHyun, Kim, Sunyoung, Kim, Hyewon, Doh, Kyoungtae, Park, Hyunjung, Yeo, Joseph, Ok, Sang-Houn, Lee, Joonhaeng, Lim, Sungsoon, Jeong, Minyoung, Choi, Seongjae, Hwang, SeungTae, Park, Eun-Young, Ma, Gwang-Ja, Han, Seok-Joo, Cha, Kwang-Seung, Sung, Nako, Ha, Jung-Woo
Tracking suspected cases of COVID-19 is crucial to suppressing the spread of COVID-19 pandemic. Active monitoring and proactive inspection are indispensable to mitigate COVID-19 spread, though these require considerable social and economic expense. To address this issue, we introduce CareCall, a call-based dialog agent which is deployed for active monitoring in Korea and Japan. We describe our system with a case study with statistics to show how the system works. Finally, we discuss a simple idea which uses CareCall to support proactive inspection.
Context-Dependent Implicit Authentication for Wearable Device User
Cheung, William, Vhaduri, Sudip
As market wearables are becoming popular with a range of services, including making financial transactions, accessing cars, etc. that they provide based on various private information of a user, security of this information is becoming very important. However, users are often flooded with PINs and passwords in this internet of things (IoT) world. Additionally, hard-biometric, such as facial or finger recognition, based authentications are not adaptable for market wearables due to their limited sensing and computation capabilities. Therefore, it is a time demand to develop a burden-free implicit authentication mechanism for wearables using the less-informative soft-biometric data that are easily obtainable from the market wearables. In this work, we present a context-dependent soft-biometric-based wearable authentication system utilizing the heart rate, gait, and breathing audio signals. From our detailed analysis, we find that a binary support vector machine (SVM) with radial basis function (RBF) kernel can achieve an average accuracy of $0.94 \pm 0.07$, $F_1$ score of $0.93 \pm 0.08$, an equal error rate (EER) of about $0.06$ at a lower confidence threshold of 0.52, which shows the promise of this work.
SOAR: Simultaneous Or of And Rules for Classification of Positive & Negative Classes
Khusainova, Elena, Dodwell, Emily, Mitra, Ritwik
Algorithmic decision making has proliferated and now impacts our daily lives in both mundane and consequential ways. Machine learning practitioners make use of a myriad of algorithms for predictive models in applications as diverse as movie recommendations, medical diagnoses, and parole recommendations without delving into the reasons driving specific predictive decisions. Machine learning algorithms in such applications are often chosen for their superior performance, however popular choices such as random forest and deep neural networks fail to provide an interpretable understanding of the predictive model. In recent years, rule-based algorithms have been used to address this issue. Wang et al. (2017) presented an or-of-and (disjunctive normal form) based classification technique that allows for classification rule mining of a single class in a binary classification; this method is also shown to perform comparably to other modern algorithms. In this work, we extend this idea to provide classification rules for both classes simultaneously. That is, we provide a distinct set of rules for both positive and negative classes. In describing this approach, we also present a novel and complete taxonomy of classifications that clearly capture and quantify the inherent ambiguity in noisy binary classifications in the real world. We show that this approach leads to a more granular formulation of the likelihood model and a simulated-annealing based optimization achieves classification performance competitive with comparable techniques. We apply our method to synthetic as well as real world data sets to compare with other related methods that demonstrate the utility of our proposal.
NFL has 77 apparently false positive COVID-19 tests from lab
Fox News Flash top headlines are here. Check out what's clicking on Foxnews.com. NEW YORK (AP) -- The NFL had 77 positive COVID-19 tests from 11 teams re-examined by a New Jersey lab after false positives, and all those tests came back negative. The league asked the New Jersey lab BioReference to investigate the results, and those 77 tests are being re-tested once more to make sure they were false positives. Among teams reporting false positives, the Minnesota Vikings said they had 12, the New York Jets 10 and the Chicago Bears nine.
Variable selection for Gaussian process regression through a sparse projection
Park, Chiwoo, Borth, David J., Wilson, Nicholas S., Hunter, Chad N.
This paper presents a new variable selection approach integrated with Gaussian process (GP) regression. We consider a sparse projection of input variables and a general stationary covariance model that depends on the Euclidean distance between the projected features. The sparse projection matrix is considered as an unknown parameter. We propose a forward stagewise approach with embedded gradient descent steps to co-optimize the parameter with other covariance parameters based on the maximization of a non-convex marginal likelihood function with a concave sparsity penalty, and some convergence properties of the algorithm are provided. The proposed model covers a broader class of stationary covariance functions than the existing automatic relevance determination approaches, and the solution approach is more computationally feasible than the existing MCMC sampling procedures for the automatic relevance parameter estimation with a sparsity prior. The approach is evaluated for a large number of simulated scenarios. The choice of tuning parameters and the accuracy of the parameter estimation are evaluated with the simulation study. In the comparison to some chosen benchmark approaches, the proposed approach has provided a better accuracy in the variable selection. It is applied to an important problem of identifying environmental factors that affect an atmospheric corrosion of metal alloys.
Towards Stable Imbalanced Data Classification via Virtual Big Data Projection
Mansourifar, Hadi, Shi, Weidong
Virtual Big Data (VBD) proved to be effective to alleviate mode collapse and vanishing generator gradient as two major problems of Generative Adversarial Neural Networks (GANs) very recently. In this paper, we investigate the capability of VBD to address two other major challenges in Machine Learning including deep autoencoder training and imbalanced data classification. First, we prove that, VBD can significantly decrease the validation loss of autoencoders via providing them a huge diversified training data which is the key to reach better generalization to minimize the over-fitting problem. Second, we use the VBD to propose the first projection-based method called cross-concatenation to balance the skewed class distributions without over-sampling. We prove that, cross-concatenation can solve uncertainty problem of data driven methods for imbalanced classification.
Upsampling Minority Classes in Imbalanced Text Classification Problems Using Markov Chains
Classification problems in supervised machine learning are often troubled by the issue of imbalanced class sizes. Given binary classified data, an imbalanced stratification of the two classes will bias the predictions of a model fit to it. A model trained on data made up of 1,000 samples labeled class "0" and 100 samples labeled class "1" could naively predict class "0" for every test instance and report 90% accuracy. Such an accuracy score is deceptive, as the model is not actually "learning" any trends from the data. This can cause serious problems in deployment.