Feature selection is the process of reducing the number of input variables when developing a predictive model. It is desirable to reduce the number of input variables to both reduce the computational cost of modeling and, in some cases, to improve the performance of the model. Feature-based feature selection methods involve evaluating the relationship between each input variable and the target variable using statistics and selecting those input variables that have the strongest relationship with the target variable. These methods can be fast and effective, although the choice of statistical measures depends on the data type of both the input and output variables. As such, it can be challenging for a machine learning practitioner to select an appropriate statistical measure for a dataset when performing filter-based feature selection.
These days there is a Cambrian explosion of various data science and machine learning tools that make it very easy to start in machine learning. Probably, you are someone who has heard about the buzzword and wanted to try it out yourself. Maybe you have gone through tutorials on one of the hot and trending machine learning libraries such as scikit-learn and want to have an idea on how to implement machine learning. You recognize that you have all the prerequisites of a problem that make it suitable for machine learning. You have the data set and also a problem that seems to have a pattern to it, but you cannot pin it down using an algorithm.
First of all, I need to import the following libraries. Then I will read the data into a pandas Dataframe. The original dataset contains 81 columns, but for the purposes of this tutorial, I will work with a subset of 12 columns. Details about the columns can be found in the provided link to the dataset. Please note that each row of the table represents a specific house (or observation).
Predictive modeling machine learning projects, such as classification and regression, always involve some form of data preparation. The specific data preparation required for a dataset depends on the specifics of the data, such as the variable types, as well as the algorithms that will be used to model them that may impose expectations or requirements on the data. Nevertheless, there is a collection of standard data preparation algorithms that can be applied to structured data (e.g. These data preparation algorithms can be organized or grouped by type into a framework that can be helpful when comparing and selecting techniques for a specific project. In this tutorial, you will discover the common data preparation tasks performed in a predictive modeling machine learning task.
There are a vast number of different types of data preparation techniques that could be used on a predictive modeling project. In some cases, the distribution of the data or the requirements of a machine learning model may suggest the data preparation needed, although this is rarely the case given the complexity and high-dimensionality of the data, the ever-increasing parade of new machine learning algorithms and limited, although human, limitations of the practitioner. Instead, data preparation can be treated as another hyperparameter to tune as part of the modeling pipeline. This raises the question of how to know what data preparation methods to consider in the search, which can feel overwhelming to experts and beginners alike. The solution is to think about the vast field of data preparation in a structured way and systematically evaluate data preparation techniques based on their effect on the raw data.