Visualizing the relationship between multiple variables can get messy very quickly. This post is about how the ggpairs() function in the GGally package does this task, as well as my own method for visualizing pairwise relationships when all the variables are categorical. For all the code in this post in one file, click here. The GGally::ggpairs() function does a really good job of visualizing the pairwise relationship for a group of variables. Let's demonstrate this on a small segment of the vehicles dataset from the fueleconomy package: Let's see how GGally::ggpairs() visualizes relationships between quantitative variables: The visualization changes a little when we have a mix of quantitative and categorical variables.
Simpson's paradox is the phenomenon that a trend of an association in the whole population reverses within the subpopulations defined by a categorical variable. Detecting Simpson's paradox indicates surprising and interesting patterns of the data set for the user. It is generally discussed in terms of binary variables, but studies for the exploration of it for continuous variables are relatively rare. This paper describes a method to discover Simpson's paradox for the trend of the pair of continuous variables. Correlation coefficient is used to indicate the association between a pair of continuous variables.
I just completed a take home assessment as part of the interview process for a company. I was told I didn't pass because my answer lacked proper training and test sets The data set consisted of a mix of categorical and numerical predictors, with the dependent variable being a numerical variable. I then removed all rows with NA values and generated boxplots for each predictor. For one variable, I replaced all of its outliers with the median. For some other variables that indicated percentage values, I did not remove the outliers because they did not seem like obvious outliers (for example, the boxplot showed that values greater than .1 were outliers, but all of those outliers still ranged from 0 to 1 so I didn't think they were typos) I then ran a Lasso linear regression model.
The'functional needs repair' category of the target variable only makes up about 7% of the whole set. The implication is that whatever algorithm you end up using it's probably going to learn the other two balanced classes a lot better than this one. Such is data science: the struggle is real. The first thing we're going to do is create an'age' variable for the waterpoints as that seems highly relevant. The'population' variable also has a highly right-skewed distribution so we're going to change that as well: One of the most important points we learned from the week before and something that will stay with me is the idea of coming up with a baseline model as fast as one can.
I need inputs on the pros and cons of building a log-reg model using dummy variables instead of the Weight of evidence approach for categorical variables. I know one of the things that needs to be looked at is the number of unique levels within a categorical variable. But, making reasonable assumptions, in a generic sense I would like to know if there are any pros and few other cons of using the Dummy variable approach vs the WoE approach.