The quality of data and the amount of useful information are key factors that determine how well a machine learning algorithm can learn. Therefore, it is absolutely critical that we make sure to encode categorical variables correctly, before we feed data into a machine learning algorithm. In this article, with simple yet effective examples we will explain how to deal with categorical data in computing machine learning algorithms and how we to map ordinal and nominal feature values to integer representations. The article is an excerpt from the book Python Machine Learning – Third Edition by Sebastian Raschka and Vahid Mirjalili. This book is a comprehensive guide to machine learning and deep learning with Python.
In order to include categorical features in your Machine Learning model, you have to encode them numerically using "dummy" or "one-hot" encoding. But how do you do this correctly using scikit-learn? In this video, you'll learn how to use OneHotEncoder and ColumnTransformer to encode your categorical features and prepare your feature matrix in a single step. You'll also learn how to include this step within a Pipeline so that you can cross-validate your model and preprocessing steps simultaneously. Finally, you'll learn why you should use scikit-learn (rather than pandas) for preprocessing your dataset.
XGBoost is a popular implementation of Gradient Boosting because of its speed and performance. Internally, XGBoost models represent all problems as a regression predictive modeling problem that only takes numerical values as input. If your data is in a different form, it must be prepared into the expected format. In this post you will discover how to prepare your data for using with gradient boosting with the XGBoost library in Python. Data Preparation for Gradient Boosting with XGBoost in Python Photo by Ed Dunens, some rights reserved.
Machine learning and deep learning models, like those in Keras, require all input and output variables to be numeric. This means that if your data contains categorical data, you must encode it to numbers before you can fit and evaluate a model. The two most popular techniques are an integer encoding and a one hot encoding, although a newer technique called learned embedding may provide a useful middle ground between these two methods. In this tutorial, you will discover how to encode categorical data when developing neural network models in Keras. How to Encode Categorical Data for Deep Learning in Keras Photo by Ken Dixon, some rights reserved. A categorical variable is a variable whose values take on the value of labels.