Improving statistical learning methods via features selection without replacement sampling and random projection

khan, Sulaiman, Ahmad, Muhammad, Ullah, Fida, Ibañez, Carlos Aguilar, Rodriguez, José Eduardo Valdez

Jun-3-2025–arXiv.org Machine Learning

Cancer is fundamentally a genetic disease characterized by genetic and epigenetic alterations that disrupt normal gene expression, leading to uncontrolled cell growth and metastasis. High-dimensional microarray datasets pose challenges for classification models due to the "small n, large p" problem, resulting in overfitting. This study makes three different key contributions: 1) we propose a machine learning-based approach integrating the Feature Selection Without Re-placement (FSWOR) technique and a projection method to improve classification accuracy. 2) We apply the Kendall statistical test to identify the most significant genes from the brain cancer mi-croarray dataset (GSE50161), reducing the feature space from 54,675 to 20,890 genes.3) we apply machine learning models using k-fold cross validation techniques in which our model incorpo-rates ensemble classifiers with LDA projection and Naïve Bayes, achieving a test score of 96%, outperforming existing methods by 9.09%. The results demonstrate the effectiveness of our ap-proach in high-dimensional gene expression analysis, improving classification accuracy while mitigating overfitting. This study contributes to cancer biomarker discovery, offering a robust computational method for analyzing microarray data.

accuracy, artificial intelligence, machine learning, (14 more...)

arXiv.org Machine Learning

Jun-3-2025

arXiv.org PDF

Add feedback

Country:
- South America (0.04)
- Asia > China (0.04)
- North America
  - Central America (0.04)
  - Mexico > Mexico City
    - Mexico City (0.04)
- Europe > Czechia
  - Prague (0.04)

Genre:
- Research Report > New Finding (1.00)

Industry:
- Health & Medicine
  - Pharmaceuticals & Biotechnology (1.00)
  - Therapeutic Area
    - Oncology > Brain Cancer (1.00)
    - Neurology (1.00)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning
  - Statistical Learning (1.00)
  - Performance Analysis > Accuracy (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found