Goto

Collaborating Authors

 Regression


Network-regularized Sparse Logistic Regression Models for Clinical Risk Prediction and Biomarker Discovery

arXiv.org Machine Learning

Molecular profiling data (e.g., gene expression) has been used for clinical risk prediction and biomarker discovery. However, it is necessary to integrate other prior knowledge like biological pathways or gene interaction networks to improve the predictive ability and biological interpretability of biomarkers. Here, we first introduce a general regularized Logistic Regression (LR) framework with regularized term $\lambda \|\bm{w}\|_1 + \eta\bm{w}^T\bm{M}\bm{w}$, which can reduce to different penalties, including Lasso, elastic net, and network-regularized terms with different $\bm{M}$. This framework can be easily solved in a unified manner by a cyclic coordinate descent algorithm which can avoid inverse matrix operation and accelerate the computing speed. However, if those estimated $\bm{w}_i$ and $\bm{w}_j$ have opposite signs, then the traditional network-regularized penalty may not perform well. To address it, we introduce a novel network-regularized sparse LR model with a new penalty $\lambda \|\bm{w}\|_1 + \eta|\bm{w}|^T\bm{M}|\bm{w}|$ to consider the difference between the absolute values of the coefficients. And we develop two efficient algorithms to solve it. Finally, we test our methods and compare them with the related ones using simulated and real data to show their efficiency.


Predictive modelling of football injuries

arXiv.org Machine Learning

The goal of this thesis is to investigate the potential of predictive modelling for football injuries. This work was conducted in close collaboration with Tottenham Hotspurs FC (THFC), the PGA European tour and the participation of Wolverhampton Wanderers (WW). Three investigations were conducted: 1. Predicting the recovery time of football injuries using the UEFA injury recordings: The UEFA recordings is a common standard for recording injuries in professional football. For this investigation, three datasets of UEFA injury recordings were available. Different machine learning algorithms were used in order to build a predictive model. The performance of the machine learning models is then improved by using feature selection conducted through correlation-based subset feature selection and random forests. 2. Predicting injuries in professional football using exposure records: The relationship between exposure (in training hours and match hours) in professional football athletes and injury incidence was studied. A common problem in football is understanding how the training schedule of an athlete can affect the chance of him getting injured. The task was to predict the number of days a player can train before he gets injured. 3. Predicting intrinsic injury incidence using in-training GPS measurements: A significant percentage of football injuries can be attributed to overtraining and fatigue. GPS data collected during training sessions might provide indicators of fatigue, or might be used to detect very intense training sessions which can lead to overtraining. This research used GPS data gathered during training sessions of the first team of THFC, in order to predict whether an injury would take place during a week.


BigML Summer 2016 Release and Webinar: Logistic Regression and more!

#artificialintelligence

BigML's Summer 2016 Release is here! GMT 02:00) for a FREE live webinar to learn about the newest version of BigML. We'll be diving into Logistic Regression, one of the most popular supervised Machine Learning methods for solving classification problems. Last Fall we launched Logistic Regressions in the BigML API to let you easily create and download models to your environment for fast, local predictions. With this Summer Release, we go a step further by bringing Logistic Regression to the BigML Dashboard.


2016 Data Science Salary Survey results

#artificialintelligence

O'Reilly has released the results of the 2016 Data Science Salary Survey. This survey is based on data from over 900 respondents to a 64-question survey about data-related tasks, tools, and the salary they receive from doing/using them. The median salary reported in the survey was US 87,000; amongst data scientists in the US, the median salary was US 106,000. Appropriately for a survey about data science, O'Reilly doesn't merely report aggregate statistics from the survey; they fit a linear regression model for a data, and extact coefficients from the model indicative of salary "bumps" (or downgrades) attributable to demographic factors. Factors that tended to increase salary included: working in cloud computing environments; working with Python; and being older.


A Technical Primer On Causality

#artificialintelligence

What does "causality" mean, and how can you represent it mathematically? How can you encode causal assumptions, and what bearing do they have on data analysis? These types of questions are at the core of the practice of data science, but deep knowledge about them is surprisingly uncommon. If you analyze data without regard to causality, you open your results up for the possibility of enormous biases. This includes everything from recommendation system results, to post-hoc reports on observational data, to experiments run without proper holdout groups. I've been blogging a lot recently about causality, and wanted to go through some of the material at a more technical level. Recent posts have been aimed at a more general audience. This one will be aimed at practitioners, and will assume a basic working knowledge of math and data analysis. To get the most from this post you should have a reasonable understanding of linear regression and probability (although we'll review a lot of probability). Prior knowledge of graphical models will make some concepts more familiar, but is not required. Judea Pearl, in his book Causality, constantly remarks that until very recently, causality was a concept in search of a language.


Calibrating random forests for probability estimation - Dankowski - 2016 - Statistics in Medicine - Wiley Online Library

#artificialintelligence

Probabilities can be consistently estimated using random forests. It is, however, unclear how random forests should be updated to make predictions for other centers or at different time points. The first method has been proposed by Elkan and may be used for updating any machine learning approach yielding consistent probabilities, so-called probability machines. The second approach is a new strategy specifically developed for random forests. Using the terminal nodes, which represent conditional probabilities, the random forest is first translated to logistic regression models.


Finite-sample and asymptotic analysis of generalization ability with an application to penalized regression

arXiv.org Machine Learning

In this paper, we study the performance of extremum estimators from the perspective of generalization ability (GA): the ability of a model to predict outcomes in new samples from the same population. By adapting the classical concentration inequalities, we derive upper bounds on the empirical out-of-sample prediction errors as a function of the in-sample errors, in-sample data size, heaviness in the tails of the error distribution, and model complexity. We show that the error bounds may be used for tuning key estimation hyper-parameters, such as the number of folds K in cross-validation. We also show how K affects the bias-variance tradeoff for cross-validation. Simulations are used to demonstrate key results. We would also like to acknowledge participants at the 12th International Symposium on Econometric Theory and Applications and the 26th New Zealand Econometric Study Group as well as seminar participants at Utah, UNSW, and University of Melbourne for useful questions and comments. Fisher would like to acknowledge the financial support of the Australian Research Council, grant DP0663477. 1 1 Introduction Traditionally in econometrics, an estimation method is implemented on sample data in order to infer patterns in a population. Put another way, inference centers on generalizing to the population the pattern learned from the sample and evaluating how well the sample pattern fits the population. An alternative perspective is to consider how well a sample pattern fits another sample. In this paper, we study the ability of a model estimated from a given sample to fit new samples from the same population, referred to as the generalization ability (GA) of the model. As a way of evaluating the external validity of sample estimates, the concept of GA has been implemented in recent empirical research. For example, in the policy evaluation literature [Belloni et al., 2013, Gechter, 2015, Dolton, 2006, Blundell et al., 2004], the central question is whether any treatment effect estimated from a pilot program can be generalized to out-of-sample individuals.


On the Relationship between Online Gaussian Process Regression and Kernel Least Mean Squares Algorithms

arXiv.org Machine Learning

ABSTRACT We study the relationship between online Gaussian process (GP) regression and kernel least mean squares (KLMS) algorithms. While the latter have no capacity of storing the entire posterior distribution during online learning, we discover that their operation corresponds to the assumption of a fixed posterior covariance that follows a simple parametric model. Interestingly, several well-known KLMS algorithms correspond to specific cases of this model. The probabilistic perspective allows us to understand how each of them handles uncertainty, which could explain some of their performance differences. Index Terms-- online learning, regression, Gaussian processes, kernel least-mean squares 1. INTRODUCTION Gaussian Process (GP) regression is a state-of-the-art Bayesian technique for nonlinear regression [1].


What is the Role of the Activation Function in a Neural Network?

#artificialintelligence

Sorry if this is too trivial, but let me start at the "very beginning:" Linear regression. The goal of (ordinary least-squares) linear regression is to find the optimal weights that -- when linearly combined with the inputs -- result in a model that minimizes the vertical offsets between the target and explanatory variables, but let's not get distracted by model fitting, which is a different topic;). So, in linear regression, we compute a linear combination of weights and inputs (let's call this function the "net input function"). Next, let's consider logistic regression. Here, we put the net input z through a non-linear "activation function" -- the logistic sigmoid function where.


A General Method for Robust Bayesian Modeling

arXiv.org Machine Learning

Robust Bayesian models are appealing alternatives to standard models, providing protection from data that contains outliers or other departures from the model assumptions. Historically, robust models were mostly developed on a case-by-case basis; examples include robust linear regression, robust mixture models, and bursty topic models. In this paper we develop a general approach to robust Bayesian modeling. We show how to turn an existing Bayesian model into a robust model, and then develop a generic strategy for computing with it. We use our method to study robust variants of several models, including linear regression, Poisson regression, logistic regression, and probabilistic topic models. We discuss the connections between our methods and existing approaches, especially empirical Bayes and James-Stein estimation.