Judging a classification model feels like it should be an easier task than judging a regression. After all, your prediction from a classification model can only either be right or wrong, while a prediction from a regression model can be more or less wrong, can have any level of error, high or low. Yet, judging a classification is not as simple as it may seem. There's more than one way for a classification to be right or to be wrong, and multiple ways to combine the different ways to be right and wrong into a unified metric. Of course, all these different metrics have different, frequently unintuitive names -- precision, recall, F1, ROC curves -- making the process seem a little forbidding from the outside.
ROC (receiver operating characteristics) curve and AOC (area under the curve) are performance measures that provide a comprehensive evaluation of classification models. AUC turns the ROC curve into a numeric representation of performance for a binary classifier. AUC is the area under the ROC curve and takes a value between 0 and 1. AUC indicates how successful a model is at separating positive and negative classes. Before going in detail, let's first explain the confusion matrix and how different threshold values change the outcome of it. A confusion matrix is not a metric to evaluate a model, but it provides insight into the predictions.
-- This study is motivated by the magnitude of the problem of Louisiana high school dropout and its negative impacts on individual and public wellbeing. Our goal is to predict students who are at risk of high school dropout, by examining Louisiana administrative dataset. Due to the imbalanced nature of the dataset, imbalanced learning techniques including resampling, case weighting, and cost-sensitive learning have been applied to enhance the prediction performance on the rare class. Performance metrics used in this study are F-measure, recall and precision of the rare class. We compare the performance of several machine learning algorithms such as neural networks, decision trees and bagging trees in combination with the imbalanced learning approaches using an administrative dataset of size of 366k from Louisiana Department of Education. Experiments show that application of imbalanced learning methods produces good results on recall but decreases precision, whereas base classifiers without regard of imbalanced data handling gives better precision but poor recall. Overall application of imbalanced learning techniques is beneficial, yet more studies are desired to improve precision. Louisiana has maintained one of the highest school dropout rates in the US for many years. The Public Affairs Research Council of Louisiana (PAR, October 2011) estimates that one in six of every public high school students in the state drops out of school.
A classifier is only as good as the metric used to evaluate it. If you choose the wrong metric to evaluate your models, you are likely to choose a poor model, or in the worst case, be misled about the expected performance of your model. Choosing an appropriate metric is challenging generally in applied machine learning, but is particularly difficult for imbalanced classification problems. Firstly, because most of the standard metrics that are widely used assume a balanced class distribution, and because typically not all classes, and therefore, not all prediction errors, are equal for imbalanced classification. In this tutorial, you will discover metrics that you can use for imbalanced classification. Tour of Evaluation Metrics for Imbalanced Classification Photo by Travis Wise, some rights reserved.
With the abundance of industrial datasets, imbalanced classification has become a common problem in several application domains. Oversampling is an effective method to solve imbalanced classification. One of the main challenges of the existing oversampling methods is to accurately label the new synthetic samples. Inaccurate labels of the synthetic samples would distort the distribution of the dataset and possibly worsen the classification performance. This paper introduces the idea of weakly supervised learning to handle the inaccurate labeling of synthetic samples caused by traditional oversampling methods. Graph semi-supervised SMOTE is developed to improve the credibility of the synthetic samples' labels. In addition, we propose cost-sensitive neighborhood components analysis for high dimensional datasets and bootstrap based ensemble framework for highly imbalanced datasets. The proposed method has achieved good classification performance on 8 synthetic datasets and 3 real-world datasets, especially for high imbalance and high dimensionality problems. The average performances and robustness are better than the benchmark methods.