Encoding categorical data: Is there yet anything 'hotter' than one-hot encoding?

Dec-28-2023–arXiv.org Artificial Intelligence

Categorical features are present in about 40% of real world problems, highlighting the crucial role of encoding as a preprocessing component. Some recent studies have reported benefits of the various target-based encoders over classical target-agnostic approaches. However, these claims are not supported by any statistical analysis, and are based on a single dataset or a very small and heterogeneous sample of datasets. The present study explores the encoding effects in an exhaustive sample of classification problems from OpenML repository. We fitted linear mixed-effects models to the experimental data, treating task ID as a random effect, and the encoding scheme and the various characteristics of categorical features as fixed effects. We found that in multiclass tasks, one-hot encoding and Helmert contrast coding outperform target-based encoders. In binary tasks, there were no significant differences across the encoding schemes; however, one-hot encoding demonstrated a marginally positive effect on the outcome. Importantly, we found no significant interactions between the encoding schemes and the characteristics of categorical features. This suggests that our findings are generalizable to a wide variety of problems across domains.

cardinality, dataset, encoder, (17 more...)

arXiv.org Artificial Intelligence

Dec-28-2023

arXiv.org PDF

Add feedback

Country:
- Europe > Russia (0.04)
- Asia > Russia
  - Siberian Federal District > Novosibirsk Oblast > Novosibirsk (0.04)

Genre:
- Research Report
  - New Finding (1.00)
  - Experimental Study > Negative Result (0.68)

Technology:
- Information Technology
  - Data Science (1.00)
  - Artificial Intelligence > Machine Learning
    - Statistical Learning (0.70)
    - Ensemble Learning (0.48)
    - Neural Networks (0.47)