Goto

Collaborating Authors

 Regression


Transformation Forests

arXiv.org Machine Learning

Regression models for supervised learning problems with a continuous target are commonly understood as models for the conditional mean of the target given predictors. This notion is simple and therefore appealing for interpretation and visualisation. Information about the whole underlying conditional distribution is, however, not available from these models. A more general understanding of regression models as models for conditional distributions allows much broader inference from such models, for example the computation of prediction intervals. Several random forest-type algorithms aim at estimating conditional distributions, most prominently quantile regression forests (Meinshausen, 2006, JMLR). We propose a novel approach based on a parametric family of distributions characterised by their transformation function. A dedicated novel "transformation tree" algorithm able to detect distributional changes is developed. Based on these transformation trees, we introduce "transformation forests" as an adaptive local likelihood estimator of conditional distribution functions. The resulting models are fully parametric yet very general and allow broad inference procedures, such as the model-based bootstrap, to be applied in a straightforward way.


Intuition behind Bias-Variance trade-off, Lasso and Ridge Regression

@machinelearnbot

We can see that Ridge and Lasso is performing far better than Linear Regression when the correlation exists in the dataset. We can tune our penalty parameter further and try to find best value of RMSE. Here, 0.01 is the best value I got for lambda. So, we can use these regression methods when the variables are highly correlated. Hope this article was useful in understanding Bias-Variance trade-off, Lasso and Ridge Regression. Please feel free to comment, give feedback and share this article if you found it useful. You can find the original article on my blog here.


Choosing the Correct Type of Regression Analysis

@machinelearnbot

Regression analysis mathematically describes the relationship between a set of independent variables and a dependent variable. There are numerous types of regression models that you can use. This choice often depends on the kind of data you have for the dependent variable and the type of model that provides the best fit. In this post, I cover the more common types of regression analyses and how to decide which one is right for your data. I'll provide an overview along with information to help you choose.


A Complete Guide to Build Better Predictive Models using Segmentation

@machinelearnbot

We use linear or logistic regression technique for developing accurate models for predicting an outcome of interest. Often, we create separate models for separate segments. To judge their effectiveness, we even make use of segmentation methods such as CHAID or CRT.


Machine Learning Algorithms: Which One to Choose for Your Problem

#artificialintelligence

Supervised learning is the task of inferring a function from labeled training data. By fitting to the labeled training set, we want to find the most optimal model parameters to predict unknown labels on other objects (test set). If the label is a real number, we call the task regression. If the label is from the limited number of values, where these values are unordered, then it's classification. In unsupervised learning we have less information about objects, in particular, the train set is unlabeled.


You have created your first Linear Regression Model. Have you validated the assumptions?

@machinelearnbot

With the dawn of the age of Data Science, there is an increased interest in learning and applying algorithms, not just by business analysts or data scientists, but by several other professionals whose core job may not be crunching data or building models. Good sign, indeed, if one understands the when, why and how of applying these fantastic techniques. If your scatterplot shows curvilinear relationship, keep in mind that higher order polynomials (2 or above) may do a better job at modelling the data. Compare models, statistics and decide for yourself which model best explains your data. For validity of a LR model, the VIF (Variance Inflationary Factor) should not be too high. How high is too high?


Adversarial Perturbation Intensity Achieving Chosen Intra-Technique Transferability Level for Logistic Regression

arXiv.org Machine Learning

Machine Learning models have been shown to be vulnerable to adversarial examples, ie. the manipulation of data by a attacker to defeat a defender's classifier at test time. We present a novel probabilistic definition of adversarial examples in perfect or limited knowledge setting using prior probability distributions on the defender's classifier. Using the asymptotic properties of the logistic regression, we derive a closed-form expression of the intensity of any adversarial perturbation, in order to achieve a given expected misclassification rate. This technique is relevant in a threat model of known model specifications and unknown training data. To our knowledge, this is the first method that allows an attacker to directly choose the probability of attack success. We evaluate our approach on two real-world datasets.


Program Evaluation and Causal Inference with High-Dimensional Data

arXiv.org Machine Learning

In this paper, we provide efficient estimators and honest confidence bands for a variety of treatment effects including local average (LATE) and local quantile treatment effects (LQTE) in data-rich environments. We can handle very many control variables, endogenous receipt of treatment, heterogeneous treatment effects, and function-valued outcomes. Our framework covers the special case of exogenous receipt of treatment, either conditional on controls or unconditionally as in randomized control trials. In the latter case, our approach produces efficient estimators and honest bands for (functional) average treatment effects (ATE) and quantile treatment effects (QTE). To make informative inference possible, we assume that key reduced form predictive relationships are approximately sparse. This assumption allows the use of regularization and selection methods to estimate those relations, and we provide methods for post-regularization and post-selection inference that are uniformly valid (honest) across a wide-range of models. We show that a key ingredient enabling honest inference is the use of orthogonal or doubly robust moment conditions in estimating certain reduced form functional parameters. We illustrate the use of the proposed methods with an application to estimating the effect of 401(k) eligibility and participation on accumulated assets.


Python Overtaking R?

@machinelearnbot

I just read two articles that claim that Python is overtaking R for data science and machine learning. From user comments, I learned that R is still strong in certain tasks. I will survey what these tasks are. The first article by Vincent Granville from DSC uses proxy metrics (as opposed to asking the users). He uses statistics from Google Trends, Indeed job search terms, and Analytic Talent (DSC job database) to conclude that Python has overtaken R. One is led to ask if one group of users (say Python's) is a more active googler.


Installation Quickstart for Azure Machine Learning services

#artificialintelligence

Azure Machine Learning services (preview) is an integrated, end-to-end data science and advanced analytics solution. It helps professional data scientists to prepare data, develop experiments, and deploy models at cloud scale. This Quickstart shows you how to create experimentation and model management accounts in Azure Machine Learning Preview. It also shows you how to install the Azure Machine Learning Workbench desktop application and CLI tools. Next, you take a quick tour of Azure Machine Learning Preview features by using the Iris flower dataset to build a model that predicts the type of iris based on some of its physical characteristics.