Goto

Collaborating Authors

Tour of Data Preparation Techniques for Machine Learning

#artificialintelligence

Predictive modeling machine learning projects, such as classification and regression, always involve some form of data preparation. The specific data preparation required for a dataset depends on the specifics of the data, such as the variable types, as well as the algorithms that will be used to model them that may impose expectations or requirements on the data. Nevertheless, there is a collection of standard data preparation algorithms that can be applied to structured data (e.g. These data preparation algorithms can be organized or grouped by type into a framework that can be helpful when comparing and selecting techniques for a specific project. In this tutorial, you will discover the common data preparation tasks performed in a predictive modeling machine learning task.


Why Data Preparation Is So Important in Machine Learning

#artificialintelligence

On a predictive modeling project, machine learning algorithms learn a mapping from input variables to a target variable. The most common form of predictive modeling project involves so-called structured data or tabular data. This is data as it looks in a spreadsheet or a matrix, with rows of examples and columns of features for each example. We cannot fit and evaluate machine learning algorithms on raw data; instead, we must transform the data to meet the requirements of individual machine learning algorithms. More than that, we must choose a representation for the data that best exposes the unknown underlying structure of the prediction problem to the learning algorithms in order to get the best performance given our available resources on a predictive modeling project. Given that we have standard implementations of highly parameterized machine learning algorithms in open source libraries, fitting models has become routine.


Framework for Data Preparation Techniques in Machine Learning

#artificialintelligence

There are a vast number of different types of data preparation techniques that could be used on a predictive modeling project. In some cases, the distribution of the data or the requirements of a machine learning model may suggest the data preparation needed, although this is rarely the case given the complexity and high-dimensionality of the data, the ever-increasing parade of new machine learning algorithms and limited, although human, limitations of the practitioner. Instead, data preparation can be treated as another hyperparameter to tune as part of the modeling pipeline. This raises the question of how to know what data preparation methods to consider in the search, which can feel overwhelming to experts and beginners alike. The solution is to think about the vast field of data preparation in a structured way and systematically evaluate data preparation techniques based on their effect on the raw data.


Framework for Data Preparation Techniques in Machine Learning

#artificialintelligence

There are a vast number of different types of data preparation techniques that could be used on a predictive modeling project. In some cases, the distribution of the data or the requirements of a machine learning model may suggest the data preparation needed, although this is rarely the case given the complexity and high-dimensionality of the data, the ever-increasing parade of new machine learning algorithms and limited, although human, limitations of the practitioner. Instead, data preparation can be treated as another hyperparameter to tune as part of the modeling pipeline. This raises the question of how to know what data preparation methods to consider in the search, which can feel overwhelming to experts and beginners alike. The solution is to think about the vast field of data preparation in a structured way and systematically evaluate data preparation techniques based on their effect on the raw data.


How to Use Feature Extraction on Tabular Data for Machine Learning

#artificialintelligence

Machine learning predictive modeling performance is only as good as your data, and your data is only as good as the way you prepare it for modeling. The most common approach to data preparation is to study a dataset and review the expectations of a machine learning algorithm, then carefully choose the most appropriate data preparation techniques to transform the raw data to best meet the expectations of the algorithm. This is slow, expensive, and requires a vast amount of expertise. An alternative approach to data preparation is to apply a suite of common and commonly useful data preparation techniques to the raw data in parallel and combine the results of all of the transforms together into a single large dataset from which a model can be fit and evaluated. This is an alternative philosophy for data preparation that treats data transforms as an approach to extract salient features from raw data to expose the structure of the problem to the learning algorithms.