Upsampling Minority Classes in Imbalanced Text Classification Problems Using Markov Chains
Classification problems in supervised machine learning are often troubled by the issue of imbalanced class sizes. Given binary classified data, an imbalanced stratification of the two classes will bias the predictions of a model fit to it. A model trained on data made up of 1,000 samples labeled class "0" and 100 samples labeled class "1" could naively predict class "0" for every test instance and report 90% accuracy. Such an accuracy score is deceptive, as the model is not actually "learning" any trends from the data. This can cause serious problems in deployment.
Aug-22-2020, 18:56:13 GMT