Machine Learning -- Perfection always starts with mistakes


It's axiomatic to say that'dirty data' is one of the biggest barriers Data Scientists face, and Data Cleansing is the most time consuming part of a ML project taking 60% of the overall time, preceded by 20% of Data Ingestion -- a remarkable total of 80% is spent in the initial phase of the project! Treating missing values is one of the most important tasks of data cleansing and as such can lead to mistakes. We need to examine the columns with the missing values and see how they relate to the rest of the data set, especially the target values. A common technique is to use the mean/median/mode of the existing values but it could be the case that this is not the right metric and we need to come up with something else. Additionally, when it comes to classification we need to consider the class structure of the data set as we can introduce a new'Undefined' category, or another possibility is to use a ML algorithm to predict the missing value.