Performance Analysis
New Methods and Datasets for Group Anomaly Detection From Fundamental Physics
Kasieczka, Gregor, Nachman, Benjamin, Shih, David
The identification of anomalous overdensities in data - group or collective anomaly detection - is a rich problem with a large number of real world applications. However, it has received relatively little attention in the broader ML community, as compared to point anomalies or other types of single instance outliers. One reason for this is the lack of powerful benchmark datasets. In this paper, we first explain how, after the Nobel-prize winning discovery of the Higgs boson, unsupervised group anomaly detection has become a new frontier of fundamental physics (where the motivation is to find new particles and forces). Then we propose a realistic synthetic benchmark dataset (LHCO2020) for the development of group anomaly detection algorithms. Finally, we compare several existing statistically-sound techniques for unsupervised group anomaly detection, and demonstrate their performance on the LHCO2020 dataset.
Feature Cross Search via Submodular Optimization
Chen, Lin, Esfandiari, Hossein, Fu, Gang, Mirrokni, Vahab S., Yu, Qian
In this paper, we study feature cross search as a fundamental primitive in feature engineering. The importance of feature cross search especially for the linear model has been known for a while, with well-known textbook examples. In this problem, the goal is to select a small subset of features, combine them to form a new feature (called the crossed feature) by considering their Cartesian product, and find feature crosses to learn an \emph{accurate} model. In particular, we study the problem of maximizing a normalized Area Under the Curve (AUC) of the linear model trained on the crossed feature column. First, we show that it is not possible to provide an $n^{1/\log\log n}$-approximation algorithm for this problem unless the exponential time hypothesis fails. This result also rules out the possibility of solving this problem in polynomial time unless $\mathsf{P}=\mathsf{NP}$. On the positive side, by assuming the \naive\ assumption, we show that there exists a simple greedy $(1-1/e)$-approximation algorithm for this problem. This result is established by relating the AUC to the total variation of the commutator of two probability measures and showing that the total variation of the commutator is monotone and submodular. To show this, we relate the submodularity of this function to the positive semi-definiteness of a corresponding kernel matrix. Then, we use Bochner's theorem to prove the positive semi-definiteness by showing that its inverse Fourier transform is non-negative everywhere. Our techniques and structural results might be of independent interest.
Mining and Predicting No-Show Medical Appointments: Using Hybrid Sampling Technique
However, no-shows are frequent in both general medical practices and specialties, and they can be quite costly and disruptive. This problem has become more severe because of COVID-19. The primary purpose of this study is to develop machine learning algorithms to predict if patients will keep their next appointment, which would help with rescheduling appointments. The main objective in addressing the no-show problem is to reduce the false negative rate (i.e., Type II error). That occurs when the model incorrectly predicts that the patients will show up for an appointment, but they do not. Moreover, the dataset encounters an imbalance issue, and this paper addresses that issue with a new and effective hybrid sampling method: ALL K-NN and adaptive synthetic (ADASYN) yield a 0% false negative rate through machine learning models.
Statistical Theory for Imbalanced Binary Classification
Within the vast body of statistical theory developed for binary classification, few meaningful results exist for imbalanced classification, in which data are dominated by samples from one of the two classes. Existing theory faces at least two main challenges. First, meaningful results must consider more complex performance measures than classification accuracy. To address this, we characterize a novel generalization of the Bayes-optimal classifier to any performance metric computed from the confusion matrix, and we use this to show how relative performance guarantees can be obtained in terms of the error of estimating the class probability function under uniform ($\mathcal{L}_\infty$) loss. Second, as we show, optimal classification performance depends on certain properties of class imbalance that have not previously been formalized. Specifically, we propose a novel sub-type of class imbalance, which we call Uniform Class Imbalance. We analyze how Uniform Class Imbalance influences optimal classifier performance and show that it necessitates different classifier behavior than other types of class imbalance. We further illustrate these two contributions in the case of $k$-nearest neighbor classification, for which we develop novel guarantees. Together, these results provide some of the first meaningful finite-sample statistical theory for imbalanced binary classification.
WisdomNet: Prognosis of COVID-19 with Slender Prospect of False Negative Cases and Vaticinating the Probability of Maturation to ARDS using Posteroanterior Chest X-Rays
Kumar, Peeyush, Gangal, Ayushe, Kumari, Sunita
Coronavirus is a large virus family consisting of diverse viruses, some of which disseminate among mammals and others cause sickness among humans. COVID-19 is highly contagious and is rapidly spreading, rendering its early diagnosis of preeminent status. Researchers, medical specialists and organizations all over the globe have been working tirelessly to combat this virus and help in its containment. In this paper, a novel neural network called WisdomNet has been proposed, for the diagnosis of COVID-19 using chest X-rays. The WisdomNet uses the concept of Wisdom of Crowds as its founding idea. It is a two-layered convolutional Neural Network (CNN), which takes chest x-ray images as input. Both layers of the proposed neural network consist of a number of neural networks each. The dataset used for this study consists of chest x-ray images of COVID-19 positive patients, compiled and shared by Dr. Cohen on GitHub, and the chest x-ray images of healthy lungs and lungs affected by viral and bacterial pneumonia were obtained from Kaggle. The network not only pinpoints the presence of COVID-19, but also gives the probability of the disease maturing into Acute Respiratory Distress Syndrome (ARDS). Thus, predicting the progression of the disease in the COVID-19 positive patients. The network also slender the occurrences of false negative cases by employing a high threshold value, thus aids in curbing the spread of the disease and gives an accuracy of 100% for successfully predicting COVID-19 among the chest x-rays of patients affected with COVID-19, bacterial and viral pneumonia.
Empirically Measuring Transfer Distance for System Design and Operation
Cody, Tyler, Adams, Stephen, Beling, Peter A.
Classical machine learning approaches are sensitive to non-stationarity. Transfer learning can address non-stationarity by sharing knowledge from one system to another, however, in areas like machine prognostics and defense, data is fundamentally limited. Therefore, transfer learning algorithms have little, if any, examples from which to learn. Herein, we suggest that these constraints on algorithmic learning can be addressed by systems engineering. We formally define transfer distance in general terms and demonstrate its use in empirically quantifying the transferability of models. We consider the use of transfer distance in the design of machine rebuild procedures to allow for transferable prognostic models. We also consider the use of transfer distance in predicting operational performance in computer vision. Practitioners can use the presented methodology to design and operate systems with consideration for the learning theoretic challenges faced by component learning systems.
Optimizing ROC Curves with a Sort-Based Surrogate Loss Function for Binary Classification and Changepoint Detection
Hillman, Jonathan, Hocking, Toby Dylan
Receiver Operating Characteristic (ROC) curves are plots of true positive rate versus false positive rate which are useful for evaluating binary classification models, but difficult to use for learning since the Area Under the Curve (AUC) is non-convex. ROC curves can also be used in other problems that have false positive and true positive rates such as changepoint detection. We show that in this more general context, the ROC curve can have loops, points with highly sub-optimal error rates, and AUC greater than one. This observation motivates a new optimization objective: rather than maximizing the AUC, we would like a monotonic ROC curve with AUC=1 that avoids points with large values for Min(FP,FN). We propose a convex relaxation of this objective that results in a new surrogate loss function called the AUM, short for Area Under Min(FP, FN). Whereas previous loss functions are based on summing over all labeled examples or pairs, the AUM requires a sort and a sum over the sequence of points on the ROC curve. We show that AUM directional derivatives can be efficiently computed and used in a gradient descent learning algorithm. In our empirical study of supervised binary classification and changepoint detection problems, we show that our new AUM minimization learning algorithm results in improved AUC and comparable speed relative to previous baselines.
Fair Decision Rules for Binary Classification
Lawless, Connor, Gunluk, Oktay
In recent years, machine learning has begun automating decision making in fields as varied as college admissions, credit lending, and criminal sentencing. The socially sensitive nature of some of these applications together with increasing regulatory constraints has necessitated the need for algorithms that are both fair and interpretable. In this paper we consider the problem of building Boolean rule sets in disjunctive normal form (DNF), an interpretable model for binary classification, subject to fairness constraints. We formulate the problem as an integer program that maximizes classification accuracy with explicit constraints on two different measures of classification parity: equality of opportunity and equalized odds. Column generation framework, with a novel formulation, is used to efficiently search over exponentially many possible rules. When combined with faster heuristics, our method can deal with large data-sets. Compared to other fair and interpretable classifiers, our method is able to find rule sets that meet stricter notions of fairness with a modest trade-off in accuracy.
Not All Mistakes Are Created Equal: Cost-sensitive Learning
In classification problems, we often assume that every misclassification is equally bad. Consider the example of trying to classify whether or not there is a terrorist threat. There are two types of misclassifications: either we predict there is a threat but there is actually no threat (false positive), or we predict there is no threat but there actually is a threat (false negative). Clearly the false negative is much more dangerous than the false positive -- we might end up wasting time and money in the false positive case, but people might die in the false negative case. We call classification problems like this cost-sensitive.
A Tool Of Choice for Bootstrapping High Quality Python Packages
PyScaffold is the tool of choice for bootstrapping high quality Python packages, ready to be shared on PyPI and installable via pip. It is easy to use and encourages the adoption of the best tools and practices of the Python ecosystem, helping you and your team to stay sane, happy and productive. It is stable and has been used by thousands of developers for over half a decade! PyScaffold is the tool of choice for bootstrapping high quality Python packages, ready to be shared on PyPI and installable via pip. It is easy to use and encourages the adoption of the best tools and practices of the Python ecosystem, helping you and your team to stay sane, happy and productive.