Class Imbalance in Credit Card Fraud Detection - Part 3 : Undersampling in Spark

#artificialintelligence 

Whenever we do classification in ML, we often assume that target label is evenly distributed in our dataset. This helps the training algorithm to learn the features as we have enough examples for all the different cases. For example, in learning a spam filter, we should have good amount of data which corresponds to emails which are spam and non spam. This even distribution is not always possible. Let's take an example of fraud detection.