categorical feature

Why Machine Learning is more Practical than Econometrics in the Real World


I've read several studies and articles that claim Econometric models are still superior to machine learning when it comes to forecasting. In the article, "Statistical and Machine Learning forecasting methods: Concerns and ways forward", the author mentions that, "After comparing the post-sample accuracy of popular ML methods with that of eight traditional statistical ones, we found that the former are dominated across both accuracy measures used and for all forecasting horizons examined." In many business environments a data scientist is responsible for generating hundreds or thousands (possibly more) forecasts for an entire company, opposed to a single series forecast. While it appears that Econometric methods are better at forecasting a single series (which I generally agree with), how do they compare at forecasting multiple series, which is likely a more common requirement in the real world? In this article, I am going to show you an experiment I ran that compares machine learning models and Econometrics models for time series forecasting on an entire company's set of stores and departments.

Dealing with categorical features in machine learning


Categorical data are commonplace in many Data Science and Machine Learning problems but are usually more challenging to deal with than numerical data. In particular, many machine learning algorithms require that their input is numerical and therefore categorical features must be transformed into numerical features before we can use any of these algorithms. One of the most common ways to make this transformation is to one-hot encode the categorical features, especially when there does not exist a natural ordering between the categories (e.g. a feature'City' with names of cities such as'London', 'Lisbon', 'Berlin', etc.). Even though this type of encoding is used very frequently, it can be frustrating to try to implement it using scikit-learn in Python, as there isn't currently a simple transformer to apply, especially if you want to use it as a step of your machine learning pipeline. In this post, I'm going to describe how you can still implement it using only scikit-learn and pandas (but with a bit of effort).

Robust Variational Autoencoders for Outlier Detection in Mixed-Type Data Machine Learning

We focus on the problem of unsupervised cell outlier detection in mixed type tabular datasets. Traditional methods for outlier detection are concerned only on detecting which rows in the dataset are outliers. However, identifying which cells in the dataset corrupt a specific row is an important problem in practice, especially in high-dimensional tables. We introduce the Robust Variational Autoencoder (RVAE), a deep generative model that learns the joint distribution of the clean data while identifying the outlier cells in the dataset. RVAE learns the probability of each cell in the dataset being an outlier, balancing the contributions of the different likelihood models in the row outlier score, making the method suitable for outlier detection in mixed type datasets. We show experimentally that the RVAE performs better than several state of the art methods in cell outlier detection for tabular datasets, while providing comparable or better results for row outlier detection.

Heart of Darkness: Logistic Regression vs. Random Forest


The'functional needs repair' category of the target variable only makes up about 7% of the whole set. The implication is that whatever algorithm you end up using it's probably going to learn the other two balanced classes a lot better than this one. Such is data science: the struggle is real. The first thing we're going to do is create an'age' variable for the waterpoints as that seems highly relevant. The'population' variable also has a highly right-skewed distribution so we're going to change that as well: One of the most important points we learned from the week before and something that will stay with me is the idea of coming up with a baseline model as fast as one can.

DLRM: An advanced, open source deep learning recommendation model


With the advent of deep learning, neural network-based personalization and recommendation models have emerged as an important tool for building recommendation systems in production environments, including here at Facebook. However, these models differ significantly from other deep learning models because they must be able to work with categorical data, which is used to describe higher-level attributes. It can be challenging for a neural network to work efficiently with this kind of sparse data, and the lack of publicly available details of representative models and data sets has slowed the research community's progress. To help advance understanding in this subfield, we are open-sourcing a state-of-the-art deep learning recommendation model (DLRM) that was implemented using Facebook's open source PyTorch and Caffe2 platforms. DLRM advances on other models by combining principles from both collaborative filtering and predictive analytics-based approaches, which enables it to work efficiently with production-scale data and provide state-of-art results.

An Enhanced Ad Event-Prediction Method Based on Feature Engineering Machine Learning

In digital advertising, Click-Through Rate (CTR) and Conversion Rate (CVR) are very important metrics for evaluating ad performance. As a result, ad event prediction systems are vital and widely used for sponsored search and display advertising as well as Real-Time Bidding (RTB). In this work, we introduce an enhanced method for ad event prediction (i.e. clicks, conversions) by proposing a new efficient feature engineering approach. A large real-world event-based dataset of a running marketing campaign is used to evaluate the efficiency of the proposed prediction algorithm. The results illustrate the benefits of the proposed ad event prediction approach, which significantly outperforms the alternative ones.

The Hitchhiker's Guide to Feature Extraction


Good Features are the backbone of any machine learning model. And good feature creation often needs domain knowledge, creativity, and lots of time. TLDR; this post is about useful feature engineering methods and tricks that I have learned and end up using often. Have you read about featuretools yet? If not, then you are going to be delighted.

Model Agnostic Contrastive Explanations for Structured Data Machine Learning

Recently, a method [7] was proposed to generate contrastive explanations for differentiable models such as deep neural networks, where one has complete access to the model. In this work, we propose a method, Model Agnostic Contrastive Explanations Method (MACEM), to generate contrastive explanations for \emph{any} classification model where one is able to \emph{only} query the class probabilities for a desired input. This allows us to generate contrastive explanations for not only neural networks, but models such as random forests, boosted trees and even arbitrary ensembles that are still amongst the state-of-the-art when learning on structured data [13]. Moreover, to obtain meaningful explanations we propose a principled approach to handle real and categorical features leading to novel formulations for computing pertinent positives and negatives that form the essence of a contrastive explanation. A detailed treatment of the different data types of this nature was not performed in the previous work, which assumed all features to be positive real valued with zero being indicative of the least interesting value. We part with this strong implicit assumption and generalize these methods so as to be applicable across a much wider range of problem settings. We quantitatively and qualitatively validate our approach over 5 public datasets covering diverse domains.



Eps 0.2.0 brings a number of improvements, including support for classification. Everyone is encouraged to help improve this project.

Predicting real-time availability of 200 million grocery items in North American stores


Ever wished there was a way to know if your favorite Ben and Jerry's ice cream flavor is currently available in a grocery store near you? Instacart's machine learning team has built tools to figure that out! Our marketplace's scale lets us build sophisticated prediction models. Our community of over 70,000 personal shoppers scans millions of items per day across 15,000 physical stores and delivers them to the customers. These stores belong to our grocery retail partners like Aldi, Costco, Krogers, Safeway, and Wegmans.