Good Features are the backbone of any machine learning model. And good feature creation often needs domain knowledge, creativity, and lots of time. And some other ideas to think about feature creation. TLDR; this post is about useful feature engineering methods and tricks that I have learned and end up using often. Have you read about featuretools yet? If not, then you are going to be delighted.
Anyone who has participated in machine learning hackathons and competitions can attest to how crucial feature engineering can be. It is often the difference between getting into the top 10 of the leaderboard and finishing outside the top 50! I have been a huge advocate of feature engineering ever since I realized it's immense potential. But it can be a slow and arduous process when done manually. I have to spend time brainstorming over what features to come up, and analyze their usability them from different angles.
A normalized, relational dataset makes it easier to perform feature engineering. Unfortunately, raw data for machine learning is often stored as a single table, which makes the normalization process tedious and time-consuming. Well, I am happy to introduce you to AutoNormalize, an open-source library that automates the normalization process and integrates seamlessly with Featuretools, another open-source library for automated feature engineering. The normalized dataset can then be returned as either an EntitySet or a collection of DataFrames. Using AutoNormalize makes it easier to get started with Featuretools and can help provide you with a quick preview of what Featuretools is capable of.
Often times it happens that we fall short of creativity. And creativity is one of the basic ingredients of what we do. So here is the list of ideas I gather in day to day life, where people have used creativity to get great results on Kaggle leaderboards. This post is inspired by a Kernel on Kaggle written by Beluga, one of the top Kagglers, for a knowledge based competition. Some of the techniques/tricks I am sharing have been taken directly from that kernel so you could take a look yourself.
ML 2.0: In this paper, we propose a paradigm shift from the current practice of creating machine learning models - which requires months-long discovery, exploration and "feasibility report" generation, followed by re-engineering for deployment - in favor of a rapid, 8-week process of development, understanding, validation and deployment that can executed by developers or subject matter experts (non-ML experts) using reusable APIs. This accomplishes what we call a "minimum viable data-driven model," delivering a ready-to-use machine learning model for problems that haven't been solved before using machine learning. We provide provisions for the refinement and adaptation of the "model," with strict enforcement and adherence to both the scaffolding/abstractions and the process. We imagine that this will bring forth the second phase in machine learning, in which discovery is subsumed by more targeted goals of delivery and impact.