Python Data Preparation Case Files: Removing Instances & Basic Imputation


Data preparation covers a lot of potential ground: data integration, data transformation, feature selection, feature engineering, and much, much more. There are, of course, varying implementations of the first approach, deleting data instances with missing values: delete all instances with any number of missing values; delete all instances with 2 or more missing values; delete all instances missing only a particular feature's value. This post will deal with the first set of data preprocessing tasks, specifically dropping some instances and preforming some basic imputation. A pair of follow up posts will demonstrate imputing a value based on the category membership of a different variable (such as using the mean salary of everyone living in Washington state to determine the missing salary values of Washington state residents) and performing imputation by regression (such as using a combination of variables to perform linear regression, and basing missing values of a different variable on the resultant linear regression model).