I spent the last hour thinking and creating an example that illustrates my question: https://imgur.com/a/36gU0pb My question indirectly relates to exploratory data analytics and feature selection for statistical models. Suppose you have some variables (let's assume they are categorical variables for this example) - when you make a histogram for these variables, they appear extremely skewed. On first glance, you would not want to include heavily skewed variables as inputs for a statistical model - e.g., if 99% of the variable is a single value, how informative and useful could it be to a statistical model? But how do you know that these heavily skewed variables don't contain some very useful information in the 1%, that might really help you in making future predictions?
Apr-30-2021, 06:06:09 GMT