A Comparison of Machine Learning Methods for Data with High-Cardinality Categorical Variables

Sigrist, Fabio

arXiv.org Artificial Intelligence 

High-cardinality categorical variables are variables for which the number of different levels is large relative to the sample size of a data set or, equivalently, there is little data per level. Highcardinality categorical variables can pose difficulties for machine learning methods such as deep neural networks and tree-based models. A simple strategy for dealing with categorical variables is to use one-hot encoding or dummy variables. But this approach often does not work well for high-cardinality categorical variables due to the reasons described below. For neural networks, a frequently adopted solution is to use entity embeddings [Guo and Berkhahn, 2016] that map every level of a categorical variable into a low-dimensional Euclidean space. For tree-boosting, an alternative to one-hot encoding is to assign a number to every level of a categorical variable, and then consider this as a one-dimensional numeric variable. Another solution implemented in the LightGBM boosting library [Ke et al., 2017] works by partitioning all levels into two subsets using an approximate approach [Fisher, 1958] when finding splits in the tree-building algorithm. Further, the CatBoost boosting library [Prokhorenkova et al., 2018] implements an approach based on ordered target statistics calculated using random partitions of the training data for handling categorical predictor variables. Random effects [Laird et al., 1982, Pinheiro and Bates, 2006] can also be used as a tool for handling high-cardinality categorical variables.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found