Modified Genetic Algorithm for Feature Selection and Hyper Parameter Optimization: Case of XGBoost in Spam Prediction
Ghatasheh, Nazeeh, Altaharwa, Ismail, Aldebei, Khaled
–arXiv.org Artificial Intelligence
Recently, spam on online social networks has attracted attention in the research and business world. Twitter has become the preferred medium to spread spam content. Many research efforts attempted to encounter social networks spam. Twitter brought extra challenges represented by the feature space size, and imbalanced data distributions. Usually, the related research works focus on part of these main challenges or produce black-box models. In this paper, we propose a modified genetic algorithm for simultaneous dimensionality reduction and hyper parameter optimization over imbalanced datasets. The algorithm initialized an eXtreme Gradient Boosting classifier and reduced the features space of tweets dataset; to generate a spam prediction model. The model is validated using a 50 times repeated 10-fold stratified cross-validation, and analyzed using nonparametric statistical tests. The resulted prediction model attains on average 82.32\% and 92.67\% in terms of geometric mean and accuracy respectively, utilizing less than 10\% of the total feature space. The empirical results show that the modified genetic algorithm outperforms $Chi^2$ and $PCA$ feature selection methods. In addition, eXtreme Gradient Boosting outperforms many machine learning algorithms, including BERT-based deep learning model, in spam prediction. Furthermore, the proposed approach is applied to SMS spam modeling and compared to related works.
arXiv.org Artificial Intelligence
Oct-30-2023
- Country:
- Asia
- Middle East
- Jordan
- Amman Governorate > Amman (0.04)
- Aqaba Governorate > Aqaba (0.04)
- Republic of Türkiye > Bingoel Province
- Bingol (0.04)
- Jordan
- Philippines (0.04)
- Singapore (0.04)
- Taiwan (0.04)
- Middle East
- Europe > Italy (0.04)
- North America > United States
- California > Alameda County
- Berkeley (0.04)
- Massachusetts > Suffolk County
- Boston (0.04)
- New York > New York County
- New York City (0.04)
- California > Alameda County
- Oceania > Australia
- New South Wales > Sydney (0.04)
- Asia
- Genre:
- Research Report
- Experimental Study (1.00)
- New Finding (1.00)
- Research Report
- Industry:
- Energy (1.00)
- Health & Medicine (1.00)
- Information Technology
- Security & Privacy (1.00)
- Services (1.00)
- Technology:
- Information Technology > Artificial Intelligence
- Machine Learning
- Ensemble Learning (1.00)
- Evolutionary Systems (1.00)
- Neural Networks > Deep Learning (1.00)
- Performance Analysis > Accuracy (1.00)
- Statistical Learning (1.00)
- Representation & Reasoning > Optimization (1.00)
- Machine Learning
- Information Technology > Artificial Intelligence