Machine learning algorithms have gained fame for being able to ferret out relevant information from datasets with many features, such as tables with dozens of rows and images with millions of pixels. Thanks to advances in cloud computing, you can often run very large machine learning models without noticing how much computational power works behind the scenes. But every new feature that you add to your problem adds to its complexity, making it harder to solve it with machine learning algorithms. Data scientists use dimensionality reduction, a set of techniques that remove excessive and irrelevant features from their machine learning models. Dimensionality reduction slashes the costs of machine learning and sometimes makes it possible to solve complicated problems with simpler models. Machine learning models map features to outcomes.
Have you ever worked on a dataset with more than a thousand features? I have, and let me tell you it's a very challenging task, especially if you don't know where to start! Having a high number of variables is both a boon and a curse. It's great that we have loads of data for analysis, but it is challenging due to size. It's not feasible to analyze each and every variable at a microscopic level. It might take us days or months to perform any meaningful analysis and we'll lose a ton of time and money for our business! Not to mention the amount of computational power this will take. We need a better way to deal with high dimensional data so that we can quickly extract patterns and insights from it. So how do we approach such a dataset?
Reducing the number of input variables for a predictive model is referred to as dimensionality reduction. Fewer input variables can result in a simpler predictive model that may have better performance when making predictions on new data. Linear Discriminant Analysis, or LDA for short, is a predictive modeling algorithm for multi-class classification. It can also be used as a dimensionality reduction technique, providing a projection of a training dataset that best separates the examples by their assigned class. The ability to use Linear Discriminant Analysis for dimensionality reduction often surprises most practitioners.
Just as the name implies, data science is a branch of science that applies the scientific method to data with the goal of studying the relationships between different features and drawing out meaningful conclusions based on these relationships. Data is, therefore, the key component in data science. A dataset is a particular instance of data that is used for analysis or model building at any given time. A dataset comes in different flavors such as numerical data, categorical data, text data, image data, voice data, and video data. A dataset could be static (not changing) or dynamic (changes with time, for example, stock prices).
Predictive modeling machine learning projects, such as classification and regression, always involve some form of data preparation. The specific data preparation required for a dataset depends on the specifics of the data, such as the variable types, as well as the algorithms that will be used to model them that may impose expectations or requirements on the data. Nevertheless, there is a collection of standard data preparation algorithms that can be applied to structured data (e.g. These data preparation algorithms can be organized or grouped by type into a framework that can be helpful when comparing and selecting techniques for a specific project. In this tutorial, you will discover the common data preparation tasks performed in a predictive modeling machine learning task.