A Comparison of Machine Learning Methods for Data with High-Cardinality Categorical Variables

Jul-5-2023–arXiv.org Artificial Intelligence

High-cardinality categorical variables are variables for which the number of different levels is large relative to the sample size of a data set or, equivalently, there is little data per level. Highcardinality categorical variables can pose difficulties for machine learning methods such as deep neural networks and tree-based models. A simple strategy for dealing with categorical variables is to use one-hot encoding or dummy variables. But this approach often does not work well for high-cardinality categorical variables due to the reasons described below. For neural networks, a frequently adopted solution is to use entity embeddings [Guo and Berkhahn, 2016] that map every level of a categorical variable into a low-dimensional Euclidean space. For tree-boosting, an alternative to one-hot encoding is to assign a number to every level of a categorical variable, and then consider this as a one-dimensional numeric variable. Another solution implemented in the LightGBM boosting library [Ke et al., 2017] works by partitioning all levels into two subsets using an approximate approach [Fisher, 1958] when finding splits in the tree-building algorithm. Further, the CatBoost boosting library [Prokhorenkova et al., 2018] implements an approach based on ordered target statistics calculated using random partitions of the training data for handling categorical predictor variables. Random effects [Laird et al., 1982, Pinheiro and Bates, 2006] can also be used as a tool for handling high-cardinality categorical variables.

artificial intelligence, categorical variable, machine learning, (19 more...)

arXiv.org Artificial Intelligence

Jul-5-2023

arXiv.org PDF

Add feedback

Country:
- Europe > Switzerland (0.04)

Genre:
- Research Report (0.40)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning
  - Ensemble Learning (0.72)
  - Neural Networks > Deep Learning (0.37)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found