Accuracy
Suicide Risk Modeling with Uncertain Diagnostic Records
Wang, Wenjie, Luo, Chongliang, Aseltine, Robert H., Wang, Fei, Yan, Jun, Chen, Kun
Motivated by the pressing need for suicide prevention through improving behavioral healthcare, we use medical claims data to study the risk of subsequent suicide attempts for patients who were hospitalized due to suicide attempts and later discharged. Understanding the risk behaviors of such patients at elevated suicide risk is an important step towards the goal of "Zero Suicide". An immediate and unconventional challenge is that the identification of suicide attempts from medical claims contains substantial uncertainty: almost 20\% of "suspected" suicide attempts are identified from diagnostic codes indicating external causes of injury and poisoning with undermined intent. It is thus of great interest to learn which of these undetermined events are more likely actual suicide attempts and how to properly utilize them in survival analysis with severe censoring. To tackle these interrelated problems, we develop an integrative Cox cure model with regularization to perform survival regression with uncertain events and a latent cure fraction. We apply the proposed approach to study the risk of subsequent suicide attempt after suicide-related hospitalization for adolescent and young adult population, using medical claims data from Connecticut. The identified risk factors are highly interpretable; more intriguingly, our method distinguishes the risk factors that are most helpful in assessing either susceptibility or timing of subsequent attempt. The predicted statuses of the uncertain attempts are further investigated, leading to several new insights on suicide event identification.
The Integrity of Machine Learning Algorithms against Software Defect Prediction
and, Param Khakhar, Dubey, Rahul Kumar, IEEE, Senior Member
The increased computerization in recent years has resulted in the production of a variety of different software, however measures need to be taken to ensure that the produced software isn't defective. Many researchers have worked in this area and have developed different Machine Learning-based approaches that predict whether the software is defective or not. This issue can't be resolved simply by using different conventional classifiers because the dataset is highly imbalanced i.e the number of defective samples detected is extremely less as compared to the number of non-defective samples. Therefore, to address this issue, certain sophisticated methods are required. The different methods developed by the researchers can be broadly classified into Resampling based methods, Cost-sensitive learning-based methods, and Ensemble Learning. Among these methods. This report analyses the performance of the Online Sequential Extreme Learning Machine (OS-ELM) proposed by Liang et.al. against several classifiers such as Logistic Regression, Support Vector Machine, Random Forest, and Na\"ive Bayes after oversampling the data. OS-ELM trains faster than conventional deep neural networks and it always converges to the globally optimal solution. A comparison is performed on the original dataset as well as the over-sampled data set. The oversampling technique used is Cluster-based Over-Sampling with Noise Filtering. This technique is better than several state-of-the-art techniques for oversampling. The analysis is carried out on 3 projects KC1, PC4 and PC3 carried out by the NASA group. The metrics used for measurement are recall and balanced accuracy. The results are higher for OS-ELM as compared to other classifiers in both scenarios.
Communication-efficient distributed eigenspace estimation
Charisopoulos, Vasileios, Benson, Austin R., Damle, Anil
Distributed computing is a standard way to scale up machine learning and data science algorithms to process large amounts of data. In such settings, avoiding communication amongst machines is paramount for achieving high performance. Rather than distribute the computation of existing algorithms, a common practice for avoiding communication is to compute local solutions or parameter estimates on each machine and then combine the results; in many convex optimization problems, even simple averaging of local solutions can work well. However, these schemes do not work when the local solutions are not unique. Spectral methods are a collection of such problems, where solutions are orthonormal bases of the leading invariant subspace of an associated data matrix, which are only unique up to rotation and reflections. Here, we develop a communication-efficient distributed algorithm for computing the leading invariant subspace of a data matrix. Our algorithm uses a novel alignment scheme that minimizes the Procrustean distance between local solutions and a reference solution, and only requires a single round of communication. For the important case of principal component analysis (PCA), we show that our algorithm achieves a similar error rate to that of a centralized estimator. We present numerical experiments demonstrating the efficacy of our proposed algorithm for distributed PCA, as well as other problems where solutions exhibit rotational symmetry, such as node embeddings for graph data and spectral initialization for quadratic sensing.
The Area Under the ROC Curve as a Measure of Clustering Quality
Jaskowiak, Pablo Andretta, Costa, Ivan Gesteira, Campello, Ricardo José Gabrielli Barreto
The Area Under the the Receiver Operating Characteristics (ROC) Curve, referred to as AUC, is a well-known performance measure in the supervised learning domain. Due to its compelling features, it has been employed in a number of studies to evaluate and compare the performance of different classifiers. In this work, we explore AUC as a performance measure in the unsupervised learning domain, more specifically, in the context of cluster analysis. In particular, we elaborate on the use of AUC as an internal/relative measure of clustering quality, which we refer to as Area Under the Curve for Clustering (AUCC). We show that the AUCC of a given candidate clustering solution has an expected value under a null model of random clustering solutions, regardless of the size of the dataset and, more importantly, regardless of the number or the (im)balance of clusters under evaluation. In addition, we demonstrate that, in the context of internal/relative clustering validation, AUCC is actually a linear transformation of the Gamma criterion from Baker and Hubert (1975), for which we also formally derive a theoretical expected value for chance clusterings. We also discuss the computational complexity of these criteria and show that, while an ordinary implementation of Gamma can be computationally prohibitive and impractical for most real applications of cluster analysis, its equivalence with AUCC actually unveils a computationally much more efficient and practical algorithmic procedure. Our theoretical findings are supported by experimental results.
Proposing a two-step Decision Support System (TPIS) based on Stacked ensemble classifier for early and low cost (step-1) and final (step-2) differential diagnosis of Mycobacterium Tuberculosis from non-tuberculosis Pneumonia
Khatibi, Toktam, Farahani, Ali, Sarmadian, Hossein
Background: Mycobacterium Tuberculosis (TB) is an infectious bacterial disease presenting similar symptoms to pneumonia; therefore, differentiating between TB and pneumonia is challenging. Therefore, the main aim of this study is proposing an automatic method for differential diagnosis of TB from Pneumonia. Methods: In this study, a two-step decision support system named TPIS is proposed for differential diagnosis of TB from pneumonia based on stacked ensemble classifiers. The first step of our proposed model aims at early diagnosis based on low-cost features including demographic characteristics and patient symptoms (including 18 features). TPIS second step makes the final decision based on the meta features extracted in the first step, the laboratory tests and chest radiography reports. This retrospective study considers 199 patient medical records for patients suffering from TB or pneumonia, which has been registered in a hospital in Arak, Iran. Results: Experimental results show that TPIS outperforms the compared machine learning methods for early differential diagnosis of pulmonary tuberculosis from pneumonia with AUC of 90.26 and accuracy of 91.37 and final decision making with AUC of 92.81 and accuracy of 93.89. Conclusions: The main advantage of early diagnosis is beginning the treatment procedure for confidently diagnosed patients as soon as possible and preventing latency in treatment. Therefore, early diagnosis reduces the maturation of late treatment of both diseases.
Simulation-Assisted Decorrelation for Resonant Anomaly Detection
Benkendorfer, Kees, Pottier, Luc Le, Nachman, Benjamin
A growing number of weak- and unsupervised machine learning approaches to anomaly detection are being proposed to significantly extend the search program at the Large Hadron Collider and elsewhere. One of the prototypical examples for these methods is the search for resonant new physics, where a bump hunt can be performed in an invariant mass spectrum. A significant challenge to methods that rely entirely on data is that they are susceptible to sculpting artificial bumps from the dependence of the machine learning classifier on the invariant mass. We explore two solutions to this challenge by minimally incorporating simulation into the learning. In particular, we study the robustness of Simulation Assisted Likelihood-free Anomaly Detection (SALAD) to correlations between the classifier and the invariant mass. Next, we propose a new approach that only uses the simulation for decorrelation but the Classification without Labels (CWoLa) approach for achieving signal sensitivity. Both methods are compared using a full background fit analysis on simulated data from the LHC Olympics and are robust to correlations in the data.
Finite-sample analysis of interpolating linear classifiers in the overparameterized regime
Chatterji, Niladri S., Long, Philip M.
A surprising statistical phenomenon has emerged in modern machine learning: highly complex models can interpolate training data while still generalizing well to test data, even in the presence of label noise. This is rather striking as it the goes against the grain of the classical statistical wisdom which dictates that predictors that generalize well should trade off between the fit to the training data and the some measure of the complexity or smoothness of the predictor. Many estimators like neural networks, kernel estimators, nearest neighbour estimators, and even linear models have been shown to demonstrate this phenomenon (see, Zhang et al. 2017; Belkin et al. 2019, among others). This phenomenon has recently inspired intense theoretical research. One line of work (Soudry et al. 2018; Ji and Telgarsky 2019; Gunasekar et al. 2017; Nacson, Srebro, and Soudry 2019; Gunasekar et al. 2018a; Gunasekar et al. 2018b) formalized the argument (Neyshabur, Tomioka, and Srebro 2014; Neyshabur 2017) that, even when there is no explicit regularization that is used in training these rich models, there is nevertheless implicit regularization encoded in the choice of the optimization method used. For example, in the setting of linear classification, (Soudry et al. 2018; Ji and Telgarsky 2019; Nacson, Srebro, and Soudry 2019) show that learning a linear classifier using gradient descent on the unregularized logistic or exponential loss asymptotically leads the solution to converge to the maximum l
Performance measures of models
Schools and colleges regularly conduct tests. The basic idea behind this is to measure the performance of the students. To understand which is their strong subject and where they need to work harder. In the field of machine learning, other than building models, it's equally important to measure the performance of the model. Basically, we check how good are the predictions made by our model.
Large Dimensional Analysis and Improvement of Multi Task Learning
Tiomoko, Malik, Couillet, Romain, Tiomoko, Hafiz
Multi Task Learning (MTL) efficiently leverages useful information c ontained in multiple related tasks to help improve the generalization performance of all tasks. This article conducts a large dimensional analysis of a simple but, as we shall see, extremely powerful when carefully tuned, Least Square Support Vector Machine (LSS VM) version of MTL, in the regime where the dimension p of the data and their number n grow large at the same rate. Under mild assumptions on the input data, the theoretical analysis o f the MTL-LSSVM algorithm first reveals the "sufficient statistics" exploited by the alg orithm and their interaction at work. These results demonstrate, as a striking consequ ence, that the standard approach to MTL-LSSVM is largely suboptimal, can lead to severe effe cts of negative transfer but that these impairments are easily corrected. These correctio ns are turned into an improved MTL-LSSVM algorithm which can only benefit from additional data, and the theoretical performance of which is also analyzed. As evidenced and theoretically sustained in numerous recent works, these large dimensional results are robust to broad ranges of data distributions, w hich our present experiments corroborate. Specifically, the article reports a systematic ally close behavior between theoretical and empirical performances on popular datasets, wh ich is strongly suggestive of the applicability of the proposed carefully tuned MTL-LSSVM method to real data. This fine-tuning is fully based on the theoretical analysis and does not in p articular require any cross validation procedure. Besides, the reported performance s on real datasets almost systematically outperform much more elaborate and less intuitive state -of-the-art multi-task and transfer learning methods.
MixBoost: Synthetic Oversampling with Boosted Mixup for Handling Extreme Imbalance
Kabra, Anubha, Chopra, Ayush, Puri, Nikaash, Badjatiya, Pinkesh, Verma, Sukriti, Gupta, Piyush, K, Balaji
Training a classification model on a dataset where the instances of one class outnumber those of the other class is a challenging problem. Such imbalanced datasets are standard in real-world situations such as fraud detection, medical diagnosis, and computational advertising. We propose an iterative data augmentation method, MixBoost, which intelligently selects (Boost) and then combines (Mix) instances from the majority and minority classes to generate synthetic hybrid instances that have characteristics of both classes. We evaluate MixBoost on 20 benchmark datasets, show that it outperforms existing approaches, and test its efficacy through significance testing. We also present ablation studies to analyze the impact of the different components of MixBoost.