This is the fourth in a four-part series on how we approach machine learning at Feature Labs. These articles cover the concepts and a full implementation as applied to predicting customer churn. The project Jupyter Notebooks are all available on GitHub. All of the work documented here was completed with open-source tools and data.) The Machine Learning Modeling ProcessThe outputs of prediction and feature engineering are a set of label times, historical examples of what we want to predict, and features, predictor variables used to train a model to predict the label.
We refined this framework through experiments both at DataScience competitions and ManoMano (an European DIY & Gardening marketplace with 1M daily users). This post introduces two different core concepts at the same time, namely feature engineering (the process of transforming raw data to meaningful features to feed the desired algortihm) and machine learning pipeline (sequential data transformation workflow from data collection to prediction). It might seem to complicate understanding to speak about these two. However it is key to understand one while keeping in mind the other because they are heavily linked. They must be applied in coordination in order to make your project succeed.
This paper describes the autofeat Python library, which provides a scikit-learn style linear regression model with automatic feature engineering and selection capabilities. Complex non-linear machine learning models such as neural networks are in practice often difficult to train and even harder to explain to non-statisticians, who require transparent analysis results as a basis for important business decisions. While linear models are efficient and intuitive, they generally provide lower prediction accuracies. Our library provides a multi-step feature engineering and selection process, where first a large pool of non-linear features is generated, from which then a small and robust set of meaningful features is selected, which improve the prediction accuracy of a linear model while retaining its interpretability.
I was browsing twitter yesterday (follow me!) when I came across this tweet by Data Science Renee linking to this Medium article called "Top 6 Errors Novice Machine Learning Engineers Make" by Christopher Dossman. This drew my attention because I'm somewhat new to the field (and even if I weren't, it's always worth reviewing the fundamentals).