Data preprocessing is the first (and arguably most important) step toward building a working machine learning model. If your data hasn't been cleaned and preprocessed, your model does not work. Data preprocessing is generally thought of as the boring part. But it's the difference between being prepared and being completely unprepared. You might not like the preparation part, but tightening down the details in advance can save you from one nightmare of a trip.
XGBoost is a popular implementation of Gradient Boosting because of its speed and performance. Internally, XGBoost models represent all problems as a regression predictive modeling problem that only takes numerical values as input. If your data is in a different form, it must be prepared into the expected format. In this post you will discover how to prepare your data for using with gradient boosting with the XGBoost library in Python. Data Preparation for Gradient Boosting with XGBoost in Python Photo by Ed Dunens, some rights reserved.
Data cleansing or data cleaning is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database and refers to identifying incomplete, incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data. The absolutely first thing you need to do is to import libraries for data preprocessing. There are lots of libraries available, but the most popular and important Python libraries for working on data are Numpy, Matplotlib, and Pandas. Numpy is the library used for all mathematical things. Pandas is the best tool available for importing and managing datasets.
Machine learning and deep learning models, like those in Keras, require all input and output variables to be numeric. This means that if your data contains categorical data, you must encode it to numbers before you can fit and evaluate a model. The two most popular techniques are an integer encoding and a one hot encoding, although a newer technique called learned embedding may provide a useful middle ground between these two methods. In this tutorial, you will discover how to encode categorical data when developing neural network models in Keras. How to Encode Categorical Data for Deep Learning in Keras Photo by Ken Dixon, some rights reserved. A categorical variable is a variable whose values take on the value of labels.
Categorical data are commonplace in many Data Science and Machine Learning problems but are usually more challenging to deal with than numerical data. In particular, many machine learning algorithms require that their input is numerical and therefore categorical features must be transformed into numerical features before we can use any of these algorithms. One of the most common ways to make this transformation is to one-hot encode the categorical features, especially when there does not exist a natural ordering between the categories (e.g. a feature'City' with names of cities such as'London', 'Lisbon', 'Berlin', etc.). Even though this type of encoding is used very frequently, it can be frustrating to try to implement it using scikit-learn in Python, as there isn't currently a simple transformer to apply, especially if you want to use it as a step of your machine learning pipeline. In this post, I'm going to describe how you can still implement it using only scikit-learn and pandas (but with a bit of effort).