Goto

Collaborating Authors

 Regression


On the Computation and Applications of Large Dense Partial Correlation Networks

arXiv.org Machine Learning

Gaussian graphical models [27] are a popular approach to describing networks, and are directly related to variable prediction via linear regression [20]. The focus is often on graphical model edges described by partial correlations which are zero, identifying pairs of nodes which are conditionally independent [2]. For example, the graphical LASSO [10] imposes a sparse regularization penalty on the precision matrix estimate, seeking a network which trades off predictive accuracy for sparsity. This provides a network which more interpretable and efficient to use, however it is not clear that sparse solutions actually generalize better to new data than dense solutions do [28]. Meanwhile, a different research direction is based on forming edges via some simple relationship such as affinity or univariate correlation. This limited network is used as a starting point for computing sophisticated dense estimates of relatedness between nodes, providing a deeper analysis of network structure. In such research, sparsity is usually imposed on the simple network, however the subsequent analysis is often based on methods which inherently presume Gaussian statistics and l penalties in some sense.


Machine Learning Algorithms In Layman's Terms, Part 1

#artificialintelligence

As a recent graduate of the Flatiron School's Data Science Bootcamp, I've been inundated with advice on how to ace technical interviews. A soft skill that keeps coming to the forefront is the ability to explain complex machine learning algorithms to a non-technical person. This series of posts is me sharing with the world how I would explain all the machine learning topics I come across on a regular basis...to my grandma. Some get a bit in-depth, others less so, but all I believe are useful to a non-Data Scientist. In the upcoming parts of this series, I'll be going over: To summarize, an algorithm is the mathematical life force behind a model.


Building an Employee Churn Model in Python to Develop a Strategic Retention Plan

#artificialintelligence

Employee turn-over (also known as "employee churn") is a costly problem for companies. The true cost of replacing an employee can often be quite large. A study by the Center for American Progress found that companies typically pay about one-fifth of an employee's salary to replace that employee, and the cost can significantly increase if executives or highest-paid employees are to be replaced. In other words, the cost of replacing employees for most employers remains significant. This is due to the amount of time spent to interview and find a replacement, sign-on bonuses, and the loss of productivity for several months while the new employee gets accustomed to the new role. Understanding why and when employees are most likely to leave can lead to actions to improve employee retention as well as possibly planning new hiring in advance.


Snap ML: 2x Faster Machine Learning than Scikit-Learn

#artificialintelligence

Last year, we announced Snap ML, a python-based machine learning framework that is designed to be a high-performance machine learning software framework. Snap ML is bundled as part of the WML Community Edition or WML CE (aka PowerAI) software distribution that is available for free on Power systems. The first release of Snap ML enabled GPU-acceleration of generalized linear models (GLMs) and also enabled scaling these models to multiple GPUs and multiple servers. GLMs are popular machine learning algorithms, which include logistic regression, linear regression, ridge and lasso regression, and support vector machines (SVMs). Our previous blog showed that Logistic Regression using Snap ML is 46 times faster than other methods, which rely on CPUs alone.


Machine Learning Basics: Building a Regression model in R

#artificialintelligence

The course "Machine Learning Basics: Building a Regression model in R" teaches you all the steps of creating a Linear Regression model, which is the most popular Machine Learning model, to solve business problems. Machine Learning is a field of computer science which gives the computer the ability to learn without being explicitly programmed. It is a branch of artificial intelligence based on the idea that systems can learn from data, identify patterns and make decisions with minimal human intervention. What is the Linear regression technique of Machine learning? Linear Regression is a simple machine learning model for regression problems, i.e., when the target variable is a real value.


XBART: Accelerated Bayesian Additive Regression Trees

arXiv.org Machine Learning

Bayesian additive regression trees (BART) (Chipman et. al., 2010) is a powerful predictive model that often outperforms alternative models at out-of-sample prediction. BART is especially well-suited to settings with unstructured predictor variables and substantial sources of unmeasured variation as is typical in the social, behavioral and health sciences. This paper develops a modified version of BART that is amenable to fast posterior estimation. We present a stochastic hill climbing algorithm that matches the remarkable predictive accuracy of previous BART implementations, but is many times faster and less memory intensive. Simulation studies show that the new method is comparable in computation time and more accurate at function estimation than both random forests and gradient boosting.


Interpretation of machine learning predictions for patient outcomes in electronic health records

arXiv.org Machine Learning

Electronic health records are an increasingly important resource for understanding the interactions between patient health, environment, and clinical decisions. In this paper we report an empirical study of predictive modeling of several patient outcomes using three state-of-the-art machine learning methods. Our primary goal is to validate the models by interpreting the importance of predictors in the final models. Central to interpretation is the use of feature importance scores, which vary depending on the underlying methodology. In order to assess feature importance, we compared univariate statistical tests, information-theoretic measures, permutation testing, and normalized coefficients from multivariate logistic regression models. In general we found poor correlation between methods in their assessment of feature importance, even when their performance is comparable and relatively good. However, permutation tests applied to random forest and gradient boosting models showed the most agreement, and the importance scores matched the clinical interpretation most frequently.


An Efficient Augmented Lagrangian Based Method for Constrained Lasso

arXiv.org Machine Learning

Variable selection is one of the most important tasks in statistics and machine learning. To incorporate more prior information about the regression coefficients, the constrained Lasso model has been proposed in the literature. In this paper, we present an inexact augmented Lagrangian method to solve the Lasso problem with linear equality constraints. By fully exploiting second-order sparsity of the problem, we are able to greatly reduce the computational cost and obtain highly efficient implementations. Furthermore, numerical results on both synthetic data and real data show that our algorithm is superior to existing first-order methods in terms of both running time and solution accuracy.


Wavelet regression and additive models for irregularly spaced data

arXiv.org Machine Learning

We present a novel approach for nonparametric regression using wavelet basis functions. Our proposal, $\texttt{waveMesh}$, can be applied to non-equispaced data with sample size not necessarily a power of 2. We develop an efficient proximal gradient descent algorithm for computing the estimator and establish adaptive minimax convergence rates. The main appeal of our approach is that it naturally extends to additive and sparse additive models for a potentially large number of covariates. We prove minimax optimal convergence rates under a weak compatibility condition for sparse additive models. The compatibility condition holds when we have a small number of covariates. Additionally, we establish convergence rates for when the condition is not met. We complement our theoretical results with empirical studies comparing $\texttt{waveMesh}$ to existing methods.


BigQuery for Data Science

#artificialintelligence

One of the perks of using Google Cloud Platform (GCP) is having BigQuery, Google's cloud hosted data warehouse solution at your disposal. BigQuery gives GCP users access to the key features of Dremel, Google's very own internal data warehouse solution. Under the hood Dremel stores data in columnar format and uses a tree architecture to parallelise queries across thousands of machines, with each query scanning the entire table. So, what is so great about that? With BigQuery you can run SQL queries on a table with billions of rows and get the results in seconds!