Visualizing the relationship between multiple variables can get messy very quickly. This post is about how the ggpairs() function in the GGally package does this task, as well as my own method for visualizing pairwise relationships when all the variables are categorical. For all the code in this post in one file, click here. The GGally::ggpairs() function does a really good job of visualizing the pairwise relationship for a group of variables. Let's demonstrate this on a small segment of the vehicles dataset from the fueleconomy package: Let's see how GGally::ggpairs() visualizes relationships between quantitative variables: The visualization changes a little when we have a mix of quantitative and categorical variables.
Simpson's paradox is the phenomenon that a trend of an association in the whole population reverses within the subpopulations defined by a categorical variable. Detecting Simpson's paradox indicates surprising and interesting patterns of the data set for the user. It is generally discussed in terms of binary variables, but studies for the exploration of it for continuous variables are relatively rare. This paper describes a method to discover Simpson's paradox for the trend of the pair of continuous variables. Correlation coefficient is used to indicate the association between a pair of continuous variables.
Machine learning and deep learning models, like those in Keras, require all input and output variables to be numeric. This means that if your data contains categorical data, you must encode it to numbers before you can fit and evaluate a model. The two most popular techniques are an integer encoding and a one hot encoding, although a newer technique called learned embedding may provide a useful middle ground between these two methods. In this tutorial, you will discover how to encode categorical data when developing neural network models in Keras. How to Encode Categorical Data for Deep Learning in Keras Photo by Ken Dixon, some rights reserved. A categorical variable is a variable whose values take on the value of labels.
Discussion[Discussion] InfoGAN training: Continuous variable is capturing digit number (0-9). I was training InfoGAN on mnist with one continuous and one categorical variable. Interestingly the continuous variable is capturing it's category(0-9) and the categorical variable is capturing some other information.
I need inputs on the pros and cons of building a log-reg model using dummy variables instead of the Weight of evidence approach for categorical variables. I know one of the things that needs to be looked at is the number of unique levels within a categorical variable. But, making reasonable assumptions, in a generic sense I would like to know if there are any pros and few other cons of using the Dummy variable approach vs the WoE approach.