Goto

Collaborating Authors

 Performance Analysis


The ratio of normalizing constants for Bayesian graphical Gaussian model selection

arXiv.org Machine Learning

The ratio of normalizing constants for the G-Wishart distribution, for two graphs differing by an edge e, has long been a bottleneck in the search for efficient model selection in the class of graphical Gaussian models. We give an accurate approximation to this ratio under two assumptions: first, we assume that the scale of the prior is the identity, second we assume that the set of paths between the two ends of e are disjoint. The first approximation does not represent a restriction since this is what statisticians use. The second assumption is a real restriction but we conjecture that similar results are also true without this second assumption. We shall prove it in subsequent work. This approximation is simply a ratio of Gamma functions and thus need no simulation. We illustrate the efficiency and practical impact of our result by comparing model selection in the class of graphical Gaussian models using this approximation and using current Metropolis-Hastings methods. We work both with simulated data and a complex high-dimensional real data set. In the numerical examples, we do not assume that the paths between the two endpoints of edge e are disjoint.


Predictive modelling of training loads and injury in Australian football

arXiv.org Machine Learning

To investigate whether training load monitoring data could be used to predict injuries in elite Australian football players, data were collected from elite athletes over 3 seasons at an Australian football club. Loads were quantified using GPS devices, accelerometers and player perceived exertion ratings. Absolute and relative training load metrics were calculated for each player each day (rolling average, exponentially weighted moving average, acute:chronic workload ratio, monotony and strain). Injury prediction models (regularised logistic regression, generalised estimating equations, random forests and support vector machines) were built for non-contact, non-contact time-loss and hamstring specific injuries using the first two seasons of data. Injury predictions were generated for the third season and evaluated using the area under the receiver operator characteristic (AUC). Predictive performance was only marginally better than chance for models of non-contact and non-contact time-loss injuries (AUC$<$0.65). The best performing model was a multivariate logistic regression for hamstring injuries (best AUC=0.76). Learning curves suggested logistic regression was underfitting the load-injury relationship and that using a more complex model or increasing the amount of model building data may lead to future improvements. Injury prediction models built using training load data from a single club showed poor ability to predict injuries when tested on previously unseen data, suggesting they are limited as a daily decision tool for practitioners. Focusing the modelling approach on specific injury types and increasing the amount of training data may lead to the development of improved predictive models for injury prevention.


Machine Learning for Everyone - Part 2: Spotting anomalous data

#artificialintelligence

We're going to analyze data that contain cases flagged as abnormal. So we'll build a predictive model in order to spot cases that are not currently flagged as abnormal, but behaving like ones that are. This post contains R code and some machine learning explanations, which can be extrapolated to other languages such as Python. The idea is to create a case study giving the reader the opportunity to recreate results. Note: There are some points oversimplified in the analysis, but hopefully you'll become curious to learn more about this topic, in case you've never done a project like this.


Learning to Detect Sepsis with a Multitask Gaussian Process RNN Classifier

arXiv.org Machine Learning

We present a scalable end-to-end classifier that uses streaming physiological and medication data to accurately predict the onset of sepsis, a life-threatening complication from infections that has high mortality and morbidity. Our proposed framework models the multivariate trajectories of continuous-valued physiological time series using multitask Gaussian processes, seamlessly accounting for the high uncertainty, frequent missingness, and irregular sampling rates typically associated with real clinical data. The Gaussian process is directly connected to a black-box classifier that predicts whether a patient will become septic, chosen in our case to be a recurrent neural network to account for the extreme variability in the length of patient encounters. We show how to scale the computations associated with the Gaussian process in a manner so that the entire system can be discriminatively trained end-to-end using backpropagation. In a large cohort of heterogeneous inpatient encounters at our university health system we find that it outperforms several baselines at predicting sepsis, and yields 19.4% and 55.5% improved areas under the Receiver Operating Characteristic and Precision Recall curves as compared to the NEWS score currently used by our hospital.


Boosting the accuracy of your Machine Learning models

#artificialintelligence

Boosting is here to help. Boosting is a popular machine learning algorithm that increases accuracy of your model, something like when racers use nitrous boost to increase the speed of their car. Boosting uses a base machine learning algorithm to fit the data. This can be any algorithm, but Decision Tree is most widely used. For an answer to why so, just keep reading.


Inspecting Algorithms for Bias

MIT Technology Review

It was a striking story. "Machine Bias," the headline read, and the teaser proclaimed: "There's software used across the country to predict future criminals. And it's biased against blacks." ProPublica, a Pulitzer Prizeโ€“winning nonprofit news organization, had analyzed risk assessment software known as COMPAS. It is being used to forecast which criminals are most likely to reoffend.


Random Forests, Decision Trees, and Categorical Predictors: The "Absent Levels" Problem

arXiv.org Machine Learning

One of the advantages that decision trees have over many other models is their ability to natively handle categorical predictors without having to first transform them (e.g., by using one-hot encoding). However, in this paper, we show how this capability can also lead to an inherent "absent levels" problem for decision tree based algorithms that, to the best of our knowledge, has never been thoroughly discussed, and whose consequences have never been carefully explored. This predicament occurs whenever there is indeterminacy in how to handle an observation that has reached a categorical split which was determined when the observation's level was absent during training. Although these incidents may appear to be innocuous, by using Leo Breiman and Adele Cutler's random forests FORTRAN code and the randomForest R package as motivating case studies, we show how overlooking the absent levels problem can systematically bias a model. Afterwards, we discuss some heuristics that can possibly be used to help mitigate the absent levels problem and, using three real data examples taken from public repositories, we demonstrate the superior performance and reliability of these heuristics over some of the existing approaches that are currently being employed in practice due to oversights in the software implementations of decision tree based algorithms. Given how extensively these algorithms have been used, it is conceivable that a sizable number of these models have been unknowingly and seriously affected by this issue---further emphasizing the need for the development of both theory and software that accounts for the absent levels problem.


Engineering features to automatically assess image quality

#artificialintelligence

While at Insight, I had the opportunity to consult on a data science project for AptDeco.com. AptDeco is a NYC based peer-to-peer online marketplace for buying and selling used furniture. The AptDeco team fills in any missing details about the furniture, creates high quality listings on their website, and even delivers the furniture when it's purchased. The first step a user takes when creating a listing on AptDeco is to submit a picture of their furniture. The editors at AptDeco will manually review the pictures, and decide if the images are of high enough quality to be edited and displayed on the front page of the listing.


Conformal k-NN Anomaly Detector for Univariate Data Streams

arXiv.org Machine Learning

Anomalies in time-series data give essential and often actionable information in many applications. In this paper we consider a model-free anomaly detection method for univariate time-series which adapts to non-stationarity in the data stream and provides probabilistic abnormality scores based on the conformal prediction paradigm. Despite its simplicity the method performs on par with complex prediction-based models on the Numenta Anomaly Detection benchmark and the Yahoo!


Sketched Ridge Regression: Optimization Perspective, Statistical Perspective, and Model Averaging

arXiv.org Machine Learning

We address the statistical and optimization impacts of using classical sketch versus Hessian sketch to solve approximately the Matrix Ridge Regression (MRR) problem. Prior research has considered the effects of classical sketch on least squares regression (LSR), a strictly simpler problem. We establish that classical sketch has a similar effect upon the optimization properties of MRR as it does on those of LSR---namely, it recovers nearly optimal solutions. In contrast, Hessian sketch does not have this guarantee, instead, the approximation error is governed by a subtle interplay between the "mass" in the responses and the optimal objective value. For both types of approximations, the regularization in the sketched MRR problem gives it significantly different statistical properties from the sketched LSR problem. In particular, there is a bias-variance trade-off in sketched MRR that is not present in sketched LSR. We provide upper and lower bounds on the biases and variances of sketched MRR, these establish that the variance is significantly increased when classical sketches are used, while the bias is significantly increased when using Hessian sketches. Empirically, sketched MRR solutions can have risks that are higher by an order-of-magnitude than those of the optimal MRR solutions. We establish theoretically and empirically that model averaging greatly decreases this gap. Thus, in the distributed setting, sketching combined with model averaging is a powerful technique that quickly obtains near-optimal solutions to the MRR problem while greatly mitigating the statistical risks incurred by sketching.