Collaborating Authors

Benchmarking Categorical Encoders


Most tabular datasets contain categorical features. The simplest way to work with these is to encode them with Label Encoder. It is simple, yet sometimes not accurate. In this post, I would like to show better approaches which could be used "out of the box" (thanks to Category Encoders Python library). I'm going to start by describing different strategies to encode categorical variables.

Optimal Sparse Linear Encoders and Sparse PCA

Neural Information Processing Systems

Principal components analysis (PCA) is the optimal linear encoder of data. Sparse linear encoders (e.g., sparse PCA) produce more interpretable features that can promote better generalization. We answer both questions by providing the first polynomial-time algorithms to construct \emph{optimal} sparse linear auto-encoders; additionally, we demonstrate the performance of our algorithms on real data. Papers published at the Neural Information Processing Systems Conference.

On the Latent Space of Wasserstein Auto-Encoders Machine Learning

We study the role of latent space dimensionality in Wasserstein auto-encoders (WAEs). Through experimentation on synthetic and real datasets, we argue that random encoders should be preferred over deterministic encoders. We highlight the potential of WAEs for representation learning with promising results on a benchmark disentanglement task.

Learning to Inpaint for Image Compression

Neural Information Processing Systems

We study the design of deep architectures for lossy image compression. We present two architectural recipes in the context of multi-stage progressive encoders and empirically demonstrate their importance on compression performance. Specifically, we show that: 1) predicting the original image data from residuals in a multi-stage progressive architecture facilitates learning and leads to improved performance at approximating the original content and 2) learning to inpaint (from neighboring image pixels) before performing compression reduces the amount of information that must be stored to achieve a high-quality approximation. Incorporating these design choices in a baseline progressive encoder yields an average reduction of over 60% in file size with similar quality compared to the original residual encoder. Papers published at the Neural Information Processing Systems Conference.

Google makes JPEG change

FOX News

The Internet is full of images, and we all want them to load as fast as possible and look as good as possible. For those companies storing and serving these images, the desire is to keep the images as small as possible. Google's research team created a new JPEG encoder that will keep everyone happy. It serves up images that look great, but their file size is 35 percent smaller. The new open source algorithm is called Guetzli, which is Swiss German for "cookie."