categorical feature

CatBoost: unbiased boosting with categorical features

Neural Information Processing Systems

This paper presents the key algorithmic techniques behind CatBoost, a new gradient boosting toolkit. Their combination leads to CatBoost outperforming other publicly available boosting implementations in terms of quality on a variety of datasets. Two critical algorithmic advances introduced in CatBoost are the implementation of ordered boosting, a permutation-driven alternative to the classic algorithm, and an innovative algorithm for processing categorical features. Both techniques were created to fight a prediction shift caused by a special kind of target leakage present in all currently existing implementations of gradient boosting algorithms. In this paper, we provide a detailed analysis of this problem and demonstrate that proposed algorithms solve it effectively, leading to excellent empirical results.

Feature Selection with sklearn and Pandas


Feature selection is one of the first and important steps while performing any machine learning task. A feature in case of a dataset simply means a column. When we get any dataset, not necessarily every column (feature) is going to have an impact on the output variable. If we add these irrelevant features in the model, it will just make the model worst (Garbage In Garbage Out). This gives rise to the need of doing feature selection.

How do I encode categorical features using scikit-learn?


In order to include categorical features in your Machine Learning model, you have to encode them numerically using "dummy" or "one-hot" encoding. But how do you do this correctly using scikit-learn? In this video, you'll learn how to use OneHotEncoder and ColumnTransformer to encode your categorical features and prepare your feature matrix in a single step. You'll also learn how to include this step within a Pipeline so that you can cross-validate your model and preprocessing steps simultaneously. Finally, you'll learn why you should use scikit-learn (rather than pandas) for preprocessing your dataset.

Using Word2Vec for Better Embeddings of Categorical Features


Back in 2012, when neural networks regained popularity, people were excited about the possibility of training models without having to worry about feature engineering. Indeed, most of the earliest breakthroughs were in computer vision, in which raw pixels were used as input for networks. Soon enough it turned out that if you wanted to use textual data, clickstream data, or pretty much any data with categorical features, at some point you'd have to ask yourself -- how do I represent my categorical features as vectors that my network can work with? The most popular approach is embedding layers -- you add an extra layer to your network, which assigns a vector to each value of the categorical feature. During training the network learns the weights for the different layers, including those embeddings.

Choosing a Machine Learning Model


Ever wonder how we can apply machine learning algorithms to a problem in order to analyze, visualize, discover trends & find correlations within data? In this article, I'm going to discuss common steps for setting up a machine learning model as well as approaches in selecting the right model for your data. This article was inspired by common interview questions that were asked about how I go along with my approach with a data science problem and why I choose said model. Machine learning tasks can be classified into either supervised learning, unsupervised learning, semi-supervised learning & reinforcement learning. In this article we don't focus on the last two, however, I'll give some idea of what they're.

Use Deep learning on tabular data by training Entity Embeddings of Categorical Variables. - Chandrasekhar's blog


Kaggle Elo merchant category recommendation being my first competition, my expectations weren't sky high and I'd be very happy if I managed to standout amongst the top 10%. I am trailing at 570 of 4000 odd data scientists in the competition. I have tried all the ML best practices and tricks known to me. I have done monstrous aggregates of aggregates, bevy of models (hehehe..) like LGBM, XG Boost, Random forests, Catboost and model post processing, parameter tuning, model blending, ensembling, feature permutation, elimination, recursive feature selection, Boruta and many more. I've written about this in detail here.



LightGBM is a gradient boosting framework that uses tree based learning algorithms. For further details, please refer to Features. Benefitting from these advantages, LightGBM is being widely-used in many winning solutions of machine learning competitions. Comparison experiments on public datasets show that LightGBM can outperform existing boosting frameworks on both efficiency and accuracy, with significantly lower memory consumption. What's more, parallel experiments show that LightGBM can achieve a linear speed-up by using multiple machines for training in specific settings.

Customer churn prediction using Neural Networks with TensorFlow.js Deep Learning for JavaScript Hackers (Part IV) - Adventures in Artificial Intelligence


TL;DR Learn about Deep Learning and create Deep Neural Network model to predict customer churn using TensorFlow.js. First day! You've landed this Data Scientist intern job at a large telecom company. You can't stop dreaming about the Lambos and designer clothes you're going to get once you're a Senior Data Scientist. Even your mom is calling to remind you to put your Ph.D. in Statistics diploma on the wall. This is the life, who cares about that you're in your mid-30s and this is your first job ever.

Data Preparation for Machine Learning: Cleansing, Transformation & Feature Engineering


The purpose of the Data Preparation stage is to get the data into the best format for machine learning, this includes three stages: Data Cleansing, Data Transformation, and Feature Engineering. Quality data is more important than using complicated algorithms so this is an incredibly important step and should not be skipped. During the Data Understanding activities, you explored your data and detected incomplete or incorrect values. Most machine learning models require all features to be complete, therefore, missing values must be dealt with. The simplest solution is to remove all rows that have a missing value but important information could be lost or bias introduced.

What's so special about CatBoost?


CatBoost is based on gradient boosting. A new machine learning technique developed by Yandex that outperforms many existing boosting algorithms like XGBoost, Light GBM. While deep learning algorithms requires lots of data and computational power, boosting algorithms are still in need for most of the business problems. However boosting algorithms like XGBoost takes hours to train and sometimes you'll get frustrated while tuning hyper-parameters. On the other hand, CatBoost is easy to implement and very powerful.