Similarity encoding for learning with dirty categorical variables

Cerda, Patricio, Varoquaux, Gaël, Kégl, Balázs

Jun-4-2018–arXiv.org Machine Learning

For statistical learning, categorical variables in a table are usually considered as discrete entities and encoded separately to feature vectors, e.g., with one-hot encoding. "Dirty" non-curated data gives rise to categorical variables with a very high cardinality but redundancy: several categories reflect the same entity. In databases, this issue is typically solved with a deduplication step. We show that a simple approach that exposes the redundancy to the learning algorithm brings significant gains. We study a generalization of one-hot encoding, similarity encoding, that builds feature vectors from similarities across categories. We perform a thorough empirical validation on non-curated tables, a problem seldom studied in machine learning. Results on seven real-world datasets show that similarity encoding brings significant gains in prediction in comparison with known encoding methods for categories or strings, notably one-hot encoding and bag of character n-grams. We draw practical recommendations for encoding dirty categories: 3-gram similarity appears to be a good choice to capture morphological resemblance. For very high-cardinality, dimensionality reduction significantly reduces the computational cost with little loss in performance: random projections or choosing a subset of prototype categories still outperforms classic encoding approaches.

artificial intelligence, category, machine learning, (17 more...)

arXiv.org Machine Learning

Jun-4-2018

arXiv.org PDF

Add feedback

Country:
- North America > United States (1.00)

Genre:
- Research Report (1.00)

Industry:
- Health & Medicine > Health Care Providers & Services (0.93)
- Education (0.68)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning
  - Supervised Learning > Representation Of Examples (0.55)
  - Statistical Learning
    - Clustering (0.46)
    - Regression (0.46)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found