Feature selection is the process of reducing the number of input variables when developing a predictive model. It is desirable to reduce the number of input variables to both reduce the computational cost of modeling and, in some cases, to improve the performance of the model. Feature-based feature selection methods involve evaluating the relationship between each input variable and the target variable using statistics and selecting those input variables that have the strongest relationship with the target variable. These methods can be fast and effective, although the choice of statistical measures depends on the data type of both the input and output variables. As such, it can be challenging for a machine learning practitioner to select an appropriate statistical measure for a dataset when performing filter-based feature selection.
These days there is a Cambrian explosion of various data science and machine learning tools that make it very easy to start in machine learning. Probably, you are someone who has heard about the buzzword and wanted to try it out yourself. Maybe you have gone through tutorials on one of the hot and trending machine learning libraries such as scikit-learn and want to have an idea on how to implement machine learning. You recognize that you have all the prerequisites of a problem that make it suitable for machine learning. You have the data set and also a problem that seems to have a pattern to it, but you cannot pin it down using an algorithm.
Data comes in various forms but can be classified into two main groups: structured data and unstructured. Structured data is data which is a form of data which has a high degree or organization such as numerical or categorical data. Temperature, phone numbers, gender are examples of structured data. Unstructured data is data in a form which doesn't explicitly have structure we are used to. Examples of unstructured data are photos, images, audio, language text and many others. There is an emerging field called Deep Learning which is using a specialized set of algorithms which perform well with unstructured data. In this guide we are going to focus on structured data, but provide brief information for the relevant topics in deep learning. The two common types of structured we commonly deal with are categorical variables (which have a finite set of values) or numerical values (which are continuous).
"Machine learning is the field of study that gives computers the ability to learn without being explicitly programmed" Arthur Smauel. We can think of machine learning as approach to automate tasks like predictions or modelling. For example, consider an email spam filter system, instead of having programmers manually looking at the emails and coming up with spam rules. We can use a machine learning algorithm and feed it input data (emails) and it will automatically discover rules that are powerful enough to distinguish spam emails. Machine learning is used in many application nowadays like spam detection in emails or movie recommendation systems that tells you movies that you might like based on your viewing history.