I recently participated in this Kaggle competition (WIDS Datathon by Stanford) where I was able to land up in Top 10 using various boosting algorithms. Since then, I have been very curious about the fine workings of each model including parameter tuning, pros and cons and hence decided to write this blog. Despite the recent re-emergence and popularity of neural networks, I am focusing on boosting algorithms because they are still more useful in the regime of limited training data, little training time and little expertise for parameter tuning. Since XGBoost (often called GBM Killer) has been in the machine learning world for a longer time now with lots of articles dedicated to it, this post will focus more on CatBoost & LGBM. LightGBM uses a novel technique of Gradient-based One-Side Sampling (GOSS) to filter out the data instances for finding a split value while XGBoost uses pre-sorted algorithm & Histogram-based algorithm for computing the best split.
In some following posts, I will explore these other methods, such as Random Forest, Support Vector Modeling, and XGboost, to see if we can improve on this customer churn model! In my previous post, we completed a pretty in-depth walk through of the exploratory data analysis process for a customer churn analysis dataset. Our data, sourced from Kaggle, is centered around customer churn, the rate at which a commercial customer will leave the commercial platform that they are currently a (paying) customer, of a telecommunications company, Telco. Now that the EDA process has been complete, and we have a pretty good sense of what our data tells us before processing, we can move on to building a Logistic Regression classification model which will allow for us to predict whether a customer is at risk to churn from Telco's platform. The complete GitHub repository with notebooks and data walkthrough can be found here.
The ability to cluster high-dimensional categorical data is essential for many machine learning applications such as bioinfomatics. Currently, central clustering of categorical data is a difficult problem due to the lack of a geometrically interpretable definition of a cluster center. In this paper, we propose a novel kernel-density-based definition using a Bayes-type probability estimator. Then, a new algorithm called k-centers is proposed for central clustering of categorical data, incorporating a new feature weighting scheme by which each attribute is automatically assigned with a weight measuring its individual contribution for the clusters. Experimental results on real-world data show outstanding performance of the proposed algorithm, especially in recognizing the biological patterns in DNA sequences.
An average data scientist deal with lots of data daily, around 60–70% time spend on data cleaning, data munging and convert the data into suitable form so that we can apply machine learning model on that data. This blog focuses on applying machine learning models, including the preprocessing steps. Many Data science enthusiast ask me how to solve machine learning problem? Before applying the machine learning models, the data must be converted to a tabular form. There is two types of data Numerical variable and Categorical variable.
Nowadays everyone knows what is customer churn and how to predict and how to handle it. Even the junior data scientists have worked on a part of it (at least). Seniors have already created a prediction model to predict that a customer is going to leave or not. But is that really simple? The answer is NO of course.