Accuracy
Classical Statistics and Statistical Learning in Imaging Neuroscience
Neuroimaging research has predominantly drawn conclusions based on classical statistics, including null-hypothesis testing, t-tests, and ANOVA. Throughout recent years, statistical learning methods enjoy increasing popularity, including cross-validation, pattern classification, and sparsity-inducing regression. These two methodological families used for neuroimaging data analysis can be viewed as two extremes of a continuum. Yet, they originated from different historical contexts, build on different theories, rest on different assumptions, evaluate different outcome metrics, and permit different conclusions. This paper portrays commonalities and differences between classical statistics and statistical learning with their relation to neuroimaging research. The conceptual implications are illustrated in three common analysis scenarios. It is thus tried to resolve possible confusion between classical hypothesis testing and data-guided model estimation by discussing their ramifications for the neuroimaging access to neurobiology.
Efficient Distributed Estimation of Inverse Covariance Matrices
Arroyo, Jesรบs, Hou, Elizabeth
ABSTRACT In distributed systems, communication is a major concern due to issues such as its vulnerability or efficiency. In this paper, we are interested in estimating sparse inverse covariance matrices when samples are distributed into different machines. We address communication efficiency by proposing a method where, in a single round of communication, each machine transfers a small subset of the entries of the inverse covariance matrix. We show that, with this efficient distributed method, the error rates can be comparable with estimation in a non-distributed setting, and correct model selection is still possible. Practical performance is shown through simulations.
An evaluation of randomized machine learning methods for redundant data: Predicting short and medium-term suicide risk from administrative records and risk assessments
Nguyen, Thuong, Tran, Truyen, Gopakumar, Shivapratap, Phung, Dinh, Venkatesh, Svetha
Accurate prediction of suicide risk in mental health patients remains an open problem. Existing methods including clinician judgments have acceptable sensitivity, but yield many false positives. Exploiting administrative data has a great potential, but the data has high dimensionality and redundancies in the recording processes. We investigate the efficacy of three most effective randomized machine learning techniques - random forests, gradient boosting machines, and deep neural nets with dropout - in predicting suicide risk. Using a cohort of mental health patients from a regional Australian hospital, we compare the predictive performance with popular traditional approaches - clinician judgments based on a checklist, sparse logistic regression and decision trees. The randomized methods demonstrated robustness against data redundancies and superior predictive performance on AUC and F-measure. Keywords: Suicide risk, Electronic medical record, Predictive models, Randomized machine learning, Deep learning 1. Introduction Every year, about 2000 Australians die by suicide causing huge trauma to families, friends, workplaces and communities[1].
Personalized Risk Scoring for Critical Care Patients using Mixtures of Gaussian Process Experts
Alaa, Ahmed M., Yoon, Jinsung, Hu, Scott, van der Schaar, Mihaela
We develop a personalized real time risk scoring algorithm that provides timely and granular assessments for the clinical acuity of ward patients based on their (temporal) lab tests and vital signs. Heterogeneity of the patients population is captured via a hierarchical latent class model. The proposed algorithm aims to discover the number of latent classes in the patients population, and train a mixture of Gaussian Process (GP) experts, where each expert models the physiological data streams associated with a specific class. Self-taught transfer learning is used to transfer the knowledge of latent classes learned from the domain of clinically stable patients to the domain of clinically deteriorating patients. For new patients, the posterior beliefs of all GP experts about the patient's clinical status given her physiological data stream are computed, and a personalized risk score is evaluated as a weighted average of those beliefs, where the weights are learned from the patient's hospital admission information. Experiments on a heterogeneous cohort of 6,313 patients admitted to Ronald Regan UCLA medical center show that our risk score outperforms the currently deployed risk scores, such as MEWS and Rothman scores.
Contrastive Structured Anomaly Detection for Gaussian Graphical Models
Gaussian graphical models (GGMs) are probabilistic tools of choice for analyzing conditional dependencies between variables in complex systems. Finding changepoints in the structural evolution of a GGM is therefore essential to detecting anomalies in the underlying system modeled by the GGM. In order to detect structural anomalies in a GGM, we consider the problem of estimating changes in the precision matrix of the corresponding Gaussian distribution. We take a two-step approach to solving this problem:- (i) estimating a background precision matrix using system observations from the past without any anomalies, and (ii) estimating a foreground precision matrix using a sliding temporal window during anomaly monitoring. Our primary contribution is in estimating the foreground precision using a novel contrastive inverse covariance estimation procedure. In order to accurately learn only the structural changes to the GGM, we maximize a penalized log-likelihood where the penalty is the $l_1$ norm of difference between the foreground precision being estimated and the already learned background precision. We modify the alternating direction method of multipliers (ADMM) algorithm for sparse inverse covariance estimation to perform contrastive estimation of the foreground precision matrix. Our results on simulated GGM data show significant improvement in precision and recall for detecting structural changes to the GGM, compared to a non-contrastive sliding window baseline.
Provable Sparse Tensor Decomposition
Sun, Will Wei, Lu, Junwei, Liu, Han, Cheng, Guang
We propose a novel sparse tensor decomposition method, namely Tensor Truncated Power (TTP) method, that incorporates variable selection into the estimation of decomposition components. The sparsity is achieved via an efficient truncation step embedded in the tensor power iteration. Our method applies to a broad family of high dimensional latent variable models, including high dimensional Gaussian mixture and mixtures of sparse regressions. A thorough theoretical investigation is further conducted. In particular, we show that the final decomposition estimator is guaranteed to achieve a local statistical rate, and further strengthen it to the global statistical rate by introducing a proper initialization procedure. In high dimensional regimes, the obtained statistical rate significantly improves those shown in the existing non-sparse decomposition methods. The empirical advantages of TTP are confirmed in extensive simulated results and two real applications of click-through rate prediction and high-dimensional gene clustering.
Generate Simple And Easy ROC Curve With AUC
The ROC curve is a great way to determine how well your classifier is doing but it can sometimes be tricky to actually generate the curve because of cryptic software instructions in languages such as R. In this tutorial we'll be using a few lines of python code to generate an interactive graph like the one shown above. To make this super simple we'll be using a python install that has all of the statistical, mathematical and graphical packages already included. This will go much more smoothly if you start off by uninstalling python from your system if you already have it installed. Once you've done that head on over to https://www.continuum.io/downloads and download the Anaconda python package which contains absolutely everything we need.
Classification of Phishing Email Using Random Forest Machine Learning Technique
Phishing is one of the major challenges faced by the world of e-commerce today. Thanks to phishing attacks, billions of dollars have been lost by many companies and individuals. In 2012, an online report put the loss due to phishing attack at about 1.5 billion. This global impact of phishing attacks will continue to be on the increase and thus requires more efficient phishing detection techniques to curb the menace. This paper investigates and reports the use of random forest machine learning algorithm in classification of phishing attacks, with the major objective of developing an improved phishing email classifier with better prediction accuracy and fewer numbers of features. From a dataset consisting of 2000 phishing and ham emails, a set of prominent phishing email features (identified from the literature) were extracted and used by the machine learning algorithm with a resulting classification accuracy of 99.7% and low false negative (FN) and false positive (FP) rates.
Using Word2Vec document vectors as features in Naive Bayes โข /r/MachineLearning
You could learn a discretization, or codebook, of your word2vec features. For example, you could run k-means on all of them (well, all your training word2vec features), then treat each one as a single instance of one of k words. Naive bayes proceeds naturally from documents as histograms of these words, and you don't even have to normalize the word counts. But yeah, it's adding another step, and another parameter (k), and discretization can throw away specificity.
Practical Data Science in Python: Guidebook
Caveat: It uses the worst possible technique for spam filtering: Naive Bayes, responsible for extremely poor spam filtering systems with tons of false positives and false negatives, still alive today. So this is definitely not a good resource to learn data science, but a great tutorial to learn Python, especially since naive Bayes is extremely easy to implement, though alternate but far better techniques such as hidden decision trees, are almost just as easy to code.