Regression
Post Selection Inference with Kernels
Yamada, Makoto, Umezu, Yuta, Fukumizu, Kenji, Takeuchi, Ichiro
We propose a novel kernel based post selection inference (PSI) algorithm, which can not only handle non-linearity in data but also structured output such as multi-dimensional and multi-label outputs. Specifically, we develop a PSI algorithm for independence measures, and propose the Hilbert-Schmidt Independence Criterion (HSIC) based PSI algorithm (hsicInf). The novelty of the proposed algorithm is that it can handle non-linearity and/or structured data through kernels. Namely, the proposed algorithm can be used for wider range of applications including nonlinear multi-class classification and multi-variate regressions, while existing PSI algorithms cannot handle them. Through synthetic experiments, we show that the proposed approach can find a set of statistically significant features for both regression and classification problems. Moreover, we apply the hsicInf algorithm to a real-world data, and show that hsicInf can successfully identify important features.
Multiple Linear Regression in Machine Learning
A couple of weeks ago I wrote an article on simple linear regression, which I would recommend reading before proceeding to read this one. Machine learning is a very interesting topic and I have been studying it on my free time. I hope this article sparks your interest in the subject or helps continue fuel it. In simple linear regression there is a one-to-one relationship between the input variable and the output variable. But in multiple linear regression, as the name implies there is a many-to-one relationship, instead of just using one input variable, you use several.
Sparse principal component regression for generalized linear models
Kawano, Shuichi, Fujisawa, Hironori, Takada, Toyoyuki, Shiroishi, Toshihiko
Principal component regression (PCR) is a widely used two-stage procedure: principal component analysis (PCA), followed by regression in which the selected principal components are regarded as new explanatory variables in the model. Note that PCA is based only on the explanatory variables, so the principal components are not selected using the information on the response variable. In this paper, we propose a one-stage procedure for PCR in the framework of generalized linear models. The basic loss function is based on a combination of the regression loss and PCA loss. An estimate of the regression parameter is obtained as the minimizer of the basic loss function with a sparse penalty. We call the proposed method sparse principal component regression for generalized linear models (SPCR-glm). Taking the two loss function into consideration simultaneously, SPCR-glm enables us to obtain sparse principal component loadings that are related to a response variable. However, a combination of loss functions may cause a parameter identification problem, but this potential problem is avoided by virtue of the sparse penalty. Thus, the sparse penalty plays two roles in this method. The parameter estimation procedure is proposed using various update algorithms with the coordinate descent algorithm. We apply SPCR-glm to two real datasets, doctor visits data and mouse consomic strain data. SPCR-glm provides more easily interpretable principal component (PC) scores and clearer classification on PC plots than the usual PCA.
Machine Learning with InsightEdge: Part II - DZone Big Data
Now that we have training and test datasets sampled, initially preprocessed and available in the data grid, we can close Web Notebook and start experimenting with different techniques and algorithms by submitting Spark applications. For our first baseline approach let's take a single feature device_conn_type and logistic regression algorithm: We will explain a little bit more what happens here. At first, we load the training dataset from the data grid, which we prepared and saved earlier with Web Notebook. Then we use StringIndexer and OneHotEncoder to map a column of categories to a column of binary vectors. For example, with 4 categories of device_conn_type, an input value of the second category would map to an output vector of [0.0, 1.0, 0.0, 0.0, 0.0].
Learning from Disaster – The Random Forest Approach.
Having tried logistic regression the first time around, I moved on to decision trees and KNN. But unfortunately, those models performed horribly and had to be scrapped. Random Forest seemed to be the buzz word around the Kaggle forums, so I obviously had to try it out next. I took a couple of days to read up on it, worked out a few examples on my own before re-taking a stab at the titanic dataset. The'caret' package is a beauty.
Logistic model tree - Wikipedia, the free encyclopedia
In computer science, a logistic model tree (LMT) is a classification model with an associated supervised training algorithm that combines logistic regression (LR) and decision tree learning.[1][2] Logistic model trees are based on the earlier idea of a model tree: a decision tree that has linear regression models at its leaves to provide a piecewise linear regression model (where ordinary decision trees with constants at their leaves would produce a piecewise constant model).[1] In the logistic variant, the LogitBoost algorithm is used to produce an LR model at every node in the tree; the node is then split using the C4.5 criterion. Each LogitBoost invocation is warm-started[vague] from its results in the parent node. Finally, the tree is pruned.[3]
Proper train and test sets when using ML on a dataset? • /r/MachineLearning
I just completed a take home assessment as part of the interview process for a company. I was told I didn't pass because my answer lacked proper training and test sets The data set consisted of a mix of categorical and numerical predictors, with the dependent variable being a numerical variable. I then removed all rows with NA values and generated boxplots for each predictor. For one variable, I replaced all of its outliers with the median. For some other variables that indicated percentage values, I did not remove the outliers because they did not seem like obvious outliers (for example, the boxplot showed that values greater than .1 were outliers, but all of those outliers still ranged from 0 to 1 so I didn't think they were typos) I then ran a Lasso linear regression model.
Introduction to Logistic Regression in R
In my previous blog I have explained about linear regression. In today's post I will explain about logistic regression. Consider a scenario where we need to predict a medical condition of a patient (HBP),HAVE HIGH BP or NO HIGH BP, based on some observed symptoms – Age, weight, Issmoking, Systolic value, Diastolic value, RACE, etc.. In this scenario we have to build a model which takes the above mentioned symptoms as input values and HBP as response variable. Note that the response variable (HBP) is a value among a fixed set of classes, HAVE HIGH BP or NO HIGH BP.