Regression
Logistic Regression categorical data issues • /r/MachineLearning
I have created a model in R using data with a lot of categorical data and it works well enough (70% classification rate). I need to transfer the code to python for launching to production, however when i transfer the code the results decrease considerably (40% classification rate). I think it may be to do with how I encode the categorical data, any ideas?
Logistic Regression Vs Decision Trees Vs SVM: Part I - Edvancer Eduventures
Classification is one of the major problems that we solve while working on standard business problems across industries. In this article we'll be discussing the major three of the many techniques used for the same, Logistic Regression, Decision Trees and Support Vector Machines [SVM]. All of the above listed algorithms are used in classification [ SVM and Decision Trees are also used for regression, but we are not discussing that today!]. Time and again I have seen people asking which one to choose for their particular problem. Classical and the most correct but least satisfying response to that question is "it depends!".
R: Simple Linear Regression
Linear Regression is a very popular prediction method and most likely the first predictive algorithm most be people learn. To put it simply, in linear regression you try to place a line of best fit through a data set and then use that line to predict new data points. Now our data file contains a listing of Years a person has worked for company A and their Salary. With a 2 variable data set, often it is quickest just to graph the data to check for a possible linear relationship. Looking at the plot, there definitely appears to be a linear relationship.
Learning Representations for Counterfactual Inference
Johansson, Fredrik D., Shalit, Uri, Sontag, David
Observational studies are rising in importance due to the widespread accumulation of data in fields such as healthcare, education, employment and ecology. We consider the task of answering counterfactual questions such as, "Would this patient have lower blood sugar had she received a different medication?". We propose a new algorithmic framework for counterfactual inference which brings together ideas from domain adaptation and representation learning. In addition to a theoretical justification, we perform an empirical comparison with previous approaches to causal inference from observational data. Our deep learning algorithm significantly outperforms the previous state-of-the-art.
What to do with an industrial/manufacturing data set? • /r/MachineLearning
I am a chemical engineer who is learning programming. Until now I've mostly been working on building interfaces using web programming. Recently I got access to all of my company's lab/quality, inventory, and PLC/manufacturing data. I am interested in digging into this data. I am (sort of) versed in Python, and have done a few tutorials with sklearn and the linear regression algorithm.
Bootstrap and cross-validation for evaluating modelling strategies
I've been re-reading Frank Harrell's Regression Modelling Strategies, a must read for anyone who ever fits a regression model, although be prepared - depending on your background, you might get 30 pages in and suddenly become convinced you've been doing nearly everything wrong before, which can be disturbing. I wanted to evaluate three simple modelling strategies in dealing with data with many variables. Using data with 54 variables on 1,785 area units from New Zealand's 2013 census, I'm looking to predict median income on the basis of the other 53 variables. The features are all continuous and are variables like "mean number of bedrooms", "proportion of individuals with no religion" and "proportion of individuals who are smokers". None of these is exactly what I would use for real, but they serve the purpose of setting up a competition of strategies that I can test with a variety of model validation techniques.
Question on Regression
To begin with, you need to provide us more information regarding what kind of data you have, what your objectives and research questions were so we can provide you with relevant help so as not to speculate. However, a general principle which I have used many often successfully is to conduct univariate regression on the combined effect of each categorical variable and then used follow on with multiple regression. If the combined effect of that categorical variable is not significant, there is no need to declare the classes for such such variables in the multiple regression model or if some of the classes are similar in nature, you could collapse then into one class and then test their combined effect again by repeating the process above. You will do this for all the categorical variables in your data set. Yes, you can use linear regression to achieve this but having 100 classes for one categorical variable, I am afraid that you will be dealing with so many degrees of freedom which might have some serious effects on the optimality of your fitted model and its predictive power so I will suggest you collapse the classes to fewer if that is possible, bearing in mind your research questions and objectives.
rasbt/python-machine-learning-book
TensorFlow is more of a low-level library; basically, we can think of TensorFlow as the Lego bricks (similar to NumPy and SciPy) that we can use to implement machine learning algorithms whereas scikit-learn comes with off-the-shelf algorithms, e.g., algorithms for classification such as SVMs, Random Forests, Logistic Regression, and many, many more. TensorFlow really shines if we want to implement deep learning algorithms, since it allows us to take advantage of GPUs for more efficient training. To get a better idea of how these two libraries differ, let's fit a softmax regression model on the Iris dataset via scikit-learn: Now, if we want to fit a Softmax regression model via TensorFlow, however, we have to "build" the algorithm first. But it really sounds more complicated than it really is. TensorFlow comes with many "convenience" functions and utilities, for example, if we want to use a gradient descent optimization approach, the core or our implementation could look like this: