Cerda, Patricio, Varoquaux, Gaël, Kégl, Balázs

For statistical learning, categorical variables in a table are usually considered as discrete entities and encoded separately to feature vectors, e.g., with one-hot encoding. "Dirty" non-curated data gives rise to categorical variables with a very high cardinality but redundancy: several categories reflect the same entity. In databases, this issue is typically solved with a deduplication step. We show that a simple approach that exposes the redundancy to the learning algorithm brings significant gains. We study a generalization of one-hot encoding, similarity encoding, that builds feature vectors from similarities across categories. We perform a thorough empirical validation on non-curated tables, a problem seldom studied in machine learning. Results on seven real-world datasets show that similarity encoding brings significant gains in prediction in comparison with known encoding methods for categories or strings, notably one-hot encoding and bag of character n-grams. We draw practical recommendations for encoding dirty categories: 3-gram similarity appears to be a good choice to capture morphological resemblance. For very high-cardinality, dimensionality reduction significantly reduces the computational cost with little loss in performance: random projections or choosing a subset of prototype categories still outperforms classic encoding approaches.

Cerda, Patricio, Varoquaux, Gaël

Statistical models usually require vector representations of categorical variables, using for instance one-hot encoding. This strategy breaks down when the number of categories grows, as it creates high-dimensional feature vectors. Additionally, for string entries, one-hot encoding does not capture information in their representation.Here, we seek low-dimensional encoding of high-cardinality string categorical variables. Ideally, these should be: scalable to many categories; interpretable to end users; and facilitate statistical analysis. We introduce two encoding approaches for string categories: Gamma-Poisson matrix factorization on substring counts, and the min-hash encoder, for fast approximation of string similarities. We show that min-hash turns set inclusions into inequality relations that are easier to learn. Both approaches are scalable and streamable. Experiments on real and simulated data show that these methods improve supervised learning with high-cardinality categorical variables. We recommend the following: if scalability is central, the min-hash encoder is the best option as it does not require any data fit; if interpretability is important, the Gamma-Poisson factorization is the best alternative, as it can be interpreted as one-hot encoding on inferred categories with informative feature names. Both models enable autoML on the original string entries as they remove the need for feature engineering or data cleaning.

ML algorithms work only with numerical values. So there is a need to model a problem and its data completely in numbers. For example, to run a clustering algorithm on a road network, representing the network / graph as an adjacency matrix is one way to model it. Similarly, a tabular data with a mix of numerical and non-numerical / categorical data also needs to be transformed or encoded to a table of only numbers for a ML algorithm to work on. Columns of string values are quite common in tabular data and in this article, some ideas on how to encode them, especially ones with high cardinality and are of known lengths like IP addresses, mobile numbers etc. are discussed.

Humans and computers don't understand data in the same way, and an active area of research in AI is determining how AI "thinks" about data. For example, the recent Quanta article Where We See Shapes, AI Sees Textures discusses an inherent disconnect between how humans and computer vision AI interpret images. The article addresses the implicit assumption many people have that when AI works with an image, it interprets the contents of the image the same way people do- by identifying the shapes of the objects. However, because most AI interprets images at a pixel level, it is more intuitive for the AI to label images by texture (i.e., more pixels in an image represent an object's texture than an object's outline or border) than by shape. Another useful example of this is in language.

Feature hashing is a powerful technique for handling sparse, high-dimensional features in machine learning. It is fast, simple, memory-efficient, and well suited to online learning scenarios. While an approximation, it has surprisingly low accuracy tradeoffs in many machine learning problems. In this post, I will cover the basics of feature hashing and how to use it for flexible, scalable feature encoding and engineering. I'll also mention feature hashing in the context of Apache Spark's MLlib machine learning library.