Are categorical variables getting lost in your random forests?
Many real-world datasets include a mix of continuous and categorical variables. The defining property of the latter is that they do not permit a total ordering. A major advantage of decision tree models and their ensemble counterparts, random forests, is that they are able to operate on both continuous and categorical variables directly. In contrast, most other popular models (e.g., generalized linear models, neural networks) must instead transform categorical variables into some numerical analog, usually by one-hot encoding them to create a new dummy variable for each level of the original variable: One-hot encoding can lead to a huge increase in the dimensionality of the feature representations. For example, one-hot encoding U.S. states adds 49 dimensions to the intuitive feature representation.
Sep-14-2020, 04:00:23 GMT
- Technology: