An average data scientist deal with lots of data daily, around 60–70% time spend on data cleaning, data munging and convert the data into suitable form so that we can apply machine learning model on that data. This blog focuses on applying machine learning models, including the preprocessing steps. Many Data science enthusiast ask me how to solve machine learning problem? Before applying the machine learning models, the data must be converted to a tabular form. There is two types of data Numerical variable and Categorical variable.
Nowadays everyone knows what is customer churn and how to predict and how to handle it. Even the junior data scientists have worked on a part of it (at least). Seniors have already created a prediction model to predict that a customer is going to leave or not. But is that really simple? The answer is NO of course.
I recently participated in this Kaggle competition (WIDS Datathon by Stanford) where I was able to land up in Top 10 using various boosting algorithms. Since then, I have been very curious about the fine workings of each model including parameter tuning, pros and cons and hence decided to write this blog. Despite the recent re-emergence and popularity of neural networks, I am focusing on boosting algorithms because they are still more useful in the regime of limited training data, little training time and little expertise for parameter tuning. Since XGBoost (often called GBM Killer) has been in the machine learning world for a longer time now with lots of articles dedicated to it, this post will focus more on CatBoost & LGBM. LightGBM uses a novel technique of Gradient-based One-Side Sampling (GOSS) to filter out the data instances for finding a split value while XGBoost uses pre-sorted algorithm & Histogram-based algorithm for computing the best split.
Good Features are the backbone of any machine learning model. And good feature creation often needs domain knowledge, creativity, and lots of time. And some other ideas to think about feature creation. TLDR; this post is about useful feature engineering methods and tricks that I have learned and end up using often. Have you read about featuretools yet? If not, then you are going to be delighted.
The'functional needs repair' category of the target variable only makes up about 7% of the whole set. The implication is that whatever algorithm you end up using it's probably going to learn the other two balanced classes a lot better than this one. Such is data science: the struggle is real. The first thing we're going to do is create an'age' variable for the waterpoints as that seems highly relevant. The'population' variable also has a highly right-skewed distribution so we're going to change that as well: One of the most important points we learned from the week before and something that will stay with me is the idea of coming up with a baseline model as fast as one can.