Robustness of Threshold-Based Feature Rankers with Data Sampling on Noisy and Imbalanced Data
Shanab, Ahmad Abu (Florida Atlantic University) | Khoshgoftaar, Taghi M. (Florida Atlantic University) | Wald, Randall (Florida Atlantic University)
Gene selection has become a vital component in the learning process when using high-dimensional gene expression data. Although extensive research has been done towards evaluating the performance of classifiers trained with the selected features, the stability of feature ranking techniques has received relatively little study. This work evaluates the robustness of eleven threshold-based feature selection techniques, examining the impact of data sampling and class noise on the stability of feature selection. To assess the robustness of feature selection techniques, we use four groups of gene expression datasets, employ eleven threshold-based feature rankers, and generate artificial class noise to better simulate real-world datasets. The results demonstrate that although no ranker consistently outperforms the others, MI and Dev show the best stability on average, while GI and PR show the least stability on average. Results also show that trying to balance datasets through data sampling has on average no positive impact on the stability of feature ranking techniques applied to those datasets. In addition, increased feature subset sizes improve stability, but only does so reliably for noisy datasets.
- Country:
- North America > United States
- New York > New York County
- New York City (0.04)
- New Jersey > Hudson County
- Secaucus (0.04)
- Florida > Palm Beach County
- Boca Raton (0.04)
- California > Orange County
- Anaheim (0.04)
- New York > New York County
- Europe > Ireland
- Leinster > County Dublin > Dublin (0.04)
- North America > United States
- Genre:
- Research Report > New Finding (0.34)
- Industry:
- Technology: