Visualizing the relationship between multiple variables can get messy very quickly. This post is about how the ggpairs() function in the GGally package does this task, as well as my own method for visualizing pairwise relationships when all the variables are categorical. For all the code in this post in one file, click here. The GGally::ggpairs() function does a really good job of visualizing the pairwise relationship for a group of variables. Let's demonstrate this on a small segment of the vehicles dataset from the fueleconomy package: Let's see how GGally::ggpairs() visualizes relationships between quantitative variables: The visualization changes a little when we have a mix of quantitative and categorical variables.
Use this well-hidden editor to quickly add a folder to the Windows 10 system path. The system path has been part of Microsoft operating systems since the earliest days of MS-DOS. This environment variable lives on in Windows 10 as a way to tell the system where to look when you try to run a command. Normally, the system looks in the Windows folder and its System32 subfolder. But you might want to add a folder to the path so that you can run custom utilities stored in that folder.
I need inputs on the pros and cons of building a log-reg model using dummy variables instead of the Weight of evidence approach for categorical variables. I know one of the things that needs to be looked at is the number of unique levels within a categorical variable. But, making reasonable assumptions, in a generic sense I would like to know if there are any pros and few other cons of using the Dummy variable approach vs the WoE approach.
This article considers the problem of multi-group classification in the setting where the number of variables $p$ is larger than the number of observations $n$. Several methods have been proposed in the literature that address this problem, however their variable selection performance is either unknown or suboptimal to the results known in the two-group case. In this work we provide sharp conditions for the consistent recovery of relevant variables in the multi-group case using the discriminant analysis proposal of Gaynanova et al., 2014. We achieve the rates of convergence that attain the optimal scaling of the sample size $n$, number of variables $p$ and the sparsity level $s$. These rates are significantly faster than the best known results in the multi-group case. Moreover, they coincide with the optimal minimax rates for the two-group case. We validate our theoretical results with numerical analysis.
We present the domain-independent HRFF algorithm, which solves goal-oriented HMDPs by incrementally aggregating plans generated by the METRIC-FF planner into a policy defined over discrete and continuous state variables. HRFF takes into account non-monotonic state variables, and complex combinations of many discrete and continuous probability distributions. We introduce new data structures and algorithmic paradigms to deal with continuous state spaces: hybrid hierarchical hash tables, domain determinization based on dynamic domain sampling or on static computation of probability distributions' modes, optimization settings under METRIC-FF based on plan probability and length. We deeply analyze the behavior of HRFF on a probabilistically-interesting structured navigation problem with continuous dead-ends and non-monotonic continuous state variables. We compare with HAO* on the Rover domain and show that HRFF outperforms HAO* by many order of magnitudes in terms of computation time and memory usage.