Good Features are the backbone of any machine learning model. And good feature creation often needs domain knowledge, creativity, and lots of time. And some other ideas to think about feature creation. TLDR; this post is about useful feature engineering methods and tricks that I have learned and end up using often. Have you read about featuretools yet? If not, then you are going to be delighted.
Anyone who has participated in machine learning hackathons and competitions can attest to how crucial feature engineering can be. It is often the difference between getting into the top 10 of the leaderboard and finishing outside the top 50! I have been a huge advocate of feature engineering ever since I realized it's immense potential. But it can be a slow and arduous process when done manually. I have to spend time brainstorming over what features to come up, and analyze their usability them from different angles.
Often times it happens that we fall short of creativity. And creativity is one of the basic ingredients of what we do. So here is the list of ideas I gather in day to day life, where people have used creativity to get great results on Kaggle leaderboards. This post is inspired by a Kernel on Kaggle written by Beluga, one of the top Kagglers, for a knowledge based competition. Some of the techniques/tricks I am sharing have been taken directly from that kernel so you could take a look yourself.
ML 2.0: In this paper, we propose a paradigm shift from the current practice of creating machine learning models - which requires months-long discovery, exploration and "feasibility report" generation, followed by re-engineering for deployment - in favor of a rapid, 8-week process of development, understanding, validation and deployment that can executed by developers or subject matter experts (non-ML experts) using reusable APIs. This accomplishes what we call a "minimum viable data-driven model," delivering a ready-to-use machine learning model for problems that haven't been solved before using machine learning. We provide provisions for the refinement and adaptation of the "model," with strict enforcement and adherence to both the scaffolding/abstractions and the process. We imagine that this will bring forth the second phase in machine learning, in which discovery is subsumed by more targeted goals of delivery and impact.
Applied Data Scientists throughout various industries are commonly faced with the challenging task of encoding high-cardinality categorical features into digestible inputs for machine learning algorithms. This paper describes a Bayesian encoding technique developed for WeWork's lead scoring engine which outputs the probability of a person touring one of our office spaces based on interaction, enrichment, and geospatial data. We present a paradigm for ensemble modeling which mitigates the need to build complicated preprocessing and encoding schemes for categorical variables. In particular, domain-specific conjugate Bayesian models are employed as base learners for features in a stacked ensemble model. For each column of a categorical feature matrix we fit a problem-specific prior distribution, for example, the Beta distribution for a binary classification problem. In order to analytically derive the moments of the posterior distribution, we update the prior with the conjugate likelihood of the corresponding target variable for each unique value of the given categorical feature. This function of column and value encodes the categorical feature matrix so that the final learner in the ensemble model ingests low-dimensional numerical input. Experimental results on both curated and real world datasets demonstrate impressive accuracy and computational efficiency on a variety of problem archetypes. Particularly, for the lead scoring engine at WeWork -- where some categorical features have as many as 300,000 levels -- we have seen an AUC improvement from 0.87 to 0.97 through implementing conjugate Bayesian model encoding.