Most tabular datasets contain categorical features. The simplest way to work with these is to encode them with Label Encoder. It is simple, yet sometimes not accurate. In this post, I would like to show better approaches which could be used "out of the box" (thanks to Category Encoders Python library). I'm going to start by describing different strategies to encode categorical variables.
Magdon-Ismail, Malik, Boutsidis, Christos
Principal components analysis (PCA) is the optimal linear encoder of data. Sparse linear encoders (e.g., sparse PCA) produce more interpretable features that can promote better generalization. We answer both questions by providing the first polynomial-time algorithms to construct \emph{optimal} sparse linear auto-encoders; additionally, we demonstrate the performance of our algorithms on real data. Papers published at the Neural Information Processing Systems Conference.
Rubenstein, Paul K., Schoelkopf, Bernhard, Tolstikhin, Ilya
We study the role of latent space dimensionality in Wasserstein auto-encoders (WAEs). Through experimentation on synthetic and real datasets, we argue that random encoders should be preferred over deterministic encoders. We highlight the potential of WAEs for representation learning with promising results on a benchmark disentanglement task.
Baig, Mohammad Haris, Koltun, Vladlen, Torresani, Lorenzo
We study the design of deep architectures for lossy image compression. We present two architectural recipes in the context of multi-stage progressive encoders and empirically demonstrate their importance on compression performance. Specifically, we show that: 1) predicting the original image data from residuals in a multi-stage progressive architecture facilitates learning and leads to improved performance at approximating the original content and 2) learning to inpaint (from neighboring image pixels) before performing compression reduces the amount of information that must be stored to achieve a high-quality approximation. Incorporating these design choices in a baseline progressive encoder yields an average reduction of over 60% in file size with similar quality compared to the original residual encoder. Papers published at the Neural Information Processing Systems Conference.
The Internet is full of images, and we all want them to load as fast as possible and look as good as possible. For those companies storing and serving these images, the desire is to keep the images as small as possible. Google's research team created a new JPEG encoder that will keep everyone happy. It serves up images that look great, but their file size is 35 percent smaller. The new open source algorithm is called Guetzli, which is Swiss German for "cookie."