Visualizing the relationship between multiple variables can get messy very quickly. This post is about how the ggpairs() function in the GGally package does this task, as well as my own method for visualizing pairwise relationships when all the variables are categorical. For all the code in this post in one file, click here. The GGally::ggpairs() function does a really good job of visualizing the pairwise relationship for a group of variables. Let's demonstrate this on a small segment of the vehicles dataset from the fueleconomy package: Let's see how GGally::ggpairs() visualizes relationships between quantitative variables: The visualization changes a little when we have a mix of quantitative and categorical variables.
I need inputs on the pros and cons of building a log-reg model using dummy variables instead of the Weight of evidence approach for categorical variables. I know one of the things that needs to be looked at is the number of unique levels within a categorical variable. But, making reasonable assumptions, in a generic sense I would like to know if there are any pros and few other cons of using the Dummy variable approach vs the WoE approach.
I just completed a take home assessment as part of the interview process for a company. I was told I didn't pass because my answer lacked proper training and test sets The data set consisted of a mix of categorical and numerical predictors, with the dependent variable being a numerical variable. I then removed all rows with NA values and generated boxplots for each predictor. For one variable, I replaced all of its outliers with the median. For some other variables that indicated percentage values, I did not remove the outliers because they did not seem like obvious outliers (for example, the boxplot showed that values greater than .1 were outliers, but all of those outliers still ranged from 0 to 1 so I didn't think they were typos) I then ran a Lasso linear regression model.
The'functional needs repair' category of the target variable only makes up about 7% of the whole set. The implication is that whatever algorithm you end up using it's probably going to learn the other two balanced classes a lot better than this one. Such is data science: the struggle is real. The first thing we're going to do is create an'age' variable for the waterpoints as that seems highly relevant. The'population' variable also has a highly right-skewed distribution so we're going to change that as well: One of the most important points we learned from the week before and something that will stay with me is the idea of coming up with a baseline model as fast as one can.
In machine learning, data are king. The algorithms and models used to make predictions with the data are important, and very interesting, but ML is still subject to the idea of garbage-in-garbage-out. With that in mind, let's look at a little subset of those input data: categorical variables. Categorical variables (wiki) are those that represent a fixed number of possible values, rather than a continuous number. Each value assigns the measurement to one of those finite groups, or categories.