Polar Encoding: A Simple Baseline Approach for Classification with Missing Values
Lenz, Oliver Urs, Peralta, Daniel, Cornelis, Chris
–arXiv.org Artificial Intelligence
We propose polar encoding, a representation of categorical and numerical $[0,1]$-valued attributes with missing values to be used in a classification context. We argue that this is a good baseline approach, because it can be used with any classification algorithm, preserves missingness information, is very simple to apply and offers good performance. In particular, unlike the existing missing-indicator approach, it does not require imputation, ensures that missing values are equidistant from non-missing values, and lets decision tree algorithms choose how to split missing values, thereby providing a practical realisation of the "missingness incorporated in attributes" (MIA) proposal. Furthermore, we show that categorical and $[0,1]$-valued attributes can be viewed as special cases of a single attribute type, corresponding to the classical concept of barycentric coordinates, and that this offers a natural interpretation of polar encoding as a fuzzified form of one-hot encoding. With an experiment based on twenty real-life datasets with missing values, we show that, in terms of the resulting classification performance, polar encoding performs better than the state-of-the-art strategies \e{multiple imputation by chained equations} (MICE) and \e{multiple imputation with denoising autoencoders} (MIDAS) and -- depending on the classifier -- about as well or better than mean/mode imputation with missing-indicators.
arXiv.org Artificial Intelligence
Dec-19-2023
- Country:
- North America > United States
- Texas (0.04)
- New York (0.04)
- Virginia > Alexandria County
- Alexandria (0.04)
- Massachusetts > Middlesex County
- Cambridge (0.04)
- California
- Orange County > Irvine (0.04)
- Monterey County > Monterey (0.04)
- Europe
- Belgium > Flanders (0.04)
- Netherlands > South Holland
- Leiden (0.04)
- Germany > Saxony
- Leipzig (0.04)
- North America > United States
- Genre:
- Research Report (0.83)
- Industry: