What data scientists keep missing about imbalanced datasets
Many data scientists fail to fully understand the problems imbalanced datasets cause and the methods to alleviate this. As data scientists we come across many different datasets where there is a clear dominance in some types of data instances (known as majority classes) with other types significantly underrepresented (minority classes). This has significant implications for the practice of data science, where simply training a model on a dataset with this characteristic will likely lead to bias towards the majority classes. For example, if we were focussed on predicting heart disease and had a dataset of 20 people with the disease and 80 without, we could have a case with a model predicting no disease every time and as such achieving a solid accuracy score of 80% and an F1-score of 88%. Despite this well-known problem, there are too many cases where data scientists have ignored this issue and just trained a model without a real understanding of imbalances within the dataset.
Aug-2-2022, 15:20:06 GMT
- Industry:
- Technology: