Goto

Collaborating Authors

 columntransformer


How to Improve Machine Learning Code Quality with Scikit-learn Pipeline and ColumnTransformer

#artificialintelligence

When you're working on a machine learning project, the most tedious steps are often data cleaning and preprocessing. Especially when you're working in a Jupyter Notebook, running code in many cells can be confusing. The Scikit-learn library has tools called Pipeline and ColumnTransformer that can really make your life easier. Instead of transforming the dataframe step by step, the pipeline combines all transformation steps. You can get the same result with less code.


Advanced Pipelines with scikit-learn

#artificialintelligence

Figure 1 shows what we would like to have at the end of this article. In the following, we will implement each of these steps. In step 5, we apply hyperparameter optimization and create a feature importance plot. EDA, feature building, maximizing the model's performance, analyzing and interpreting the outcome are not in the scope of this article. The goal is to show you how to work with a pipeline that integrates modules from different packages.


OneHotEncoder in one go

#artificialintelligence

We are a beginner in machine learning and are excited to process our dataset into the machine learning algorithm. But then we discover that our machine learning algorithm can process only numerical data. And our dataset has values that are non-numeric/strings. Hmmm, so how can we feed this non-numeric data into the algorithm? Here is the stage where OneHotEncoder can help us.


Are you using Pipeline in Scikit-Learn?

#artificialintelligence

If you are doing Machine Learning, you would have come across pipelines as they help you to make a better machine learning workflow which is easy to understand and reproducible. In case you are not aware of the pipelines you can refer awesome blogs from Rebecca Vickery "A Simple Guide to Scikit-learn Pipelines" and Saptashwa Bhattacharyya "A Simple Example of Pipeline in Machine Learning with Scikit-learn". Let's see how it can be done. To best demonstrate, I am going to use the Titanic dataset from OpenML here to walkthrough on how you can create a data pipeline. I am going to use a subset of features for the demo purposes here.


Imbalanced Classification with the Adult Income Dataset

#artificialintelligence

Many binary classification tasks do not have an equal number of examples from each class, e.g. the class distribution is skewed or imbalanced. A popular example is the adult income dataset that involves predicting personal income levels as above or below $50,000 per year based on personal details such as relationship and education level. There are many more cases of incomes less than $50K than above $50K, although the skew is not severe. This means that techniques for imbalanced classification can be used whilst model performance can still be reported using classification accuracy, as is used with balanced classification problems. In this tutorial, you will discover how to develop and evaluate a model for the imbalanced adult income classification dataset. Develop an Imbalanced Classification Model to Predict Income Photo by Kirt Edblom, some rights reserved.