Encoding categorical data: Is there yet anything 'hotter' than one-hot encoding?
Poslavskaya, Ekaterina, Korolev, Alexey
–arXiv.org Artificial Intelligence
Categorical features are present in about 40% of real world problems, highlighting the crucial role of encoding as a preprocessing component. Some recent studies have reported benefits of the various target-based encoders over classical target-agnostic approaches. However, these claims are not supported by any statistical analysis, and are based on a single dataset or a very small and heterogeneous sample of datasets. The present study explores the encoding effects in an exhaustive sample of classification problems from OpenML repository. We fitted linear mixed-effects models to the experimental data, treating task ID as a random effect, and the encoding scheme and the various characteristics of categorical features as fixed effects. We found that in multiclass tasks, one-hot encoding and Helmert contrast coding outperform target-based encoders. In binary tasks, there were no significant differences across the encoding schemes; however, one-hot encoding demonstrated a marginally positive effect on the outcome. Importantly, we found no significant interactions between the encoding schemes and the characteristics of categorical features. This suggests that our findings are generalizable to a wide variety of problems across domains.
arXiv.org Artificial Intelligence
Dec-28-2023
- Genre:
- Research Report
- New Finding (1.00)
- Experimental Study > Negative Result (0.68)
- Research Report
- Technology: