Comparing Approaches for Combining Data Sampling and Feature Selection to Address Key Data Quality Issues in Tweet Sentiment Analysis

Prusa, Joseph D. (Florida Atlantic University) | Khoshgoftaar, Taghi M. (Florida Atlantic University)

AAAI Conferences 

When training tweet sentiment classifiers, many data quality challenges must be addressed. One potential issue is class imbalance, where most instances belong to a single majority class. This may negatively impact classifier performance as classifiers trained on imbalanced data may favor classification of new, unseen instances as belonging to the majority class. This issue is accompanied by a second challenge, high-dimesionality, since very large numbers of text based features are used to describe tweet datasets. For datasets where both of these challenges are present, we can combine feature selection and data sampling to address both highdimensionality and class imbalance. However, three potential approaches exist for combining data sampling and feature selection and it is unclear which approach is optimal. In this paper, we seek to determine if there is a best approach for combining data sampling and feature selection. We conduct tests using random undersampling with two post-sampling class ratios (50:50 and 35:65) combined with three feature rankers. Classifiers are trained with each potential combination approach using seven different learners on two datasets. We found that, overall, classifiers trained by performing feature selection followed by data sampling performed better than the other two approaches; however, the differences were only significant for the more imbalanced dataset.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found