boruta
BoMGene: Integrating Boruta-mRMR feature selection for enhanced Gene expression classification
Phan, Bich-Chung, Ma, Thanh, Nguyen, Huu-Hoa, Do, Thanh-Nghi
Feature selection is a crucial step in analyzing gene expression data, enhancing classification performance, and reducing computational costs for high-dimensional datasets. This paper proposes BoMGene, a hybrid feature selection method that effectively integrates two popular techniques: Boruta and Minimum Redundancy Maximum Relevance (mRMR). The method aims to optimize the feature space and enhance classification accuracy. Experiments were conducted on 25 publicly available gene expression datasets, employing widely used classifiers such as Support Vector Machine (SVM), Random Forest, XGBoost (XGB), and Gradient Boosting Machine (GBM). The results show that using the Boruta-mRMR combination cuts down the number of features chosen compared to just using mRMR, which helps to speed up training time while keeping or even improving classification accuracy compared to using individual feature selection methods. The proposed approach demonstrates clear advantages in accuracy, stability, and practical applicability for multi-class gene expression data analysis
BOLIMES: Boruta and LIME optiMized fEature Selection for Gene Expression Classification
Phan, Bich-Chung, Ma, Thanh, Nguyen, Huu-Hoa, Do, and Thanh-Nghi
Gene expression classification is a pivotal yet challenging task in bioinformatics, primarily due to the high dimensionality of genomic data and the risk of overfitting. To bridge this gap, we propose BOLIMES, a novel feature selection algorithm designed to enhance gene expression classification by systematically refining the feature subset. Unlike conventional methods that rely solely on statistical ranking or classifier-specific selection, we integrate the robustness of Boruta with the interpretability of LIME, ensuring that only the most relevant and influential genes are retained. BOLIMES first employs Boruta to filter out non-informative genes by comparing each feature against its randomized counterpart, thus preserving valuable information. It then uses LIME to rank the remaining genes based on their local importance to the classifier. Finally, an iterative classification evaluation determines the optimal feature subset by selecting the number of genes that maximizes predictive accuracy. By combining exhaustive feature selection with interpretability-driven refinement, our solution effectively balances dimensionality reduction with high classification performance, offering a powerful solution for high-dimensional gene expression analysis.
Noise-Augmented Boruta: The Neural Network Perturbation Infusion with Boruta Feature Selection
Gharoun, Hassan, Yazdanjoe, Navid, Khorshidi, Mohammad Sadegh, Gandomi, Amir H.
With the surge in data generation, both vertically (i.e., volume of data) and horizontally (i.e., dimensionality), the burden of the curse of dimensionality has become increasingly palpable. Feature selection, a key facet of dimensionality reduction techniques, has advanced considerably to address this challenge. One such advancement is the Boruta feature selection algorithm, which successfully discerns meaningful features by contrasting them to their permutated counterparts known as shadow features. However, the significance of a feature is shaped more by the data's overall traits than by its intrinsic value, a sentiment echoed in the conventional Boruta algorithm where shadow features closely mimic the characteristics of the original ones. Building on this premise, this paper introduces an innovative approach to the Boruta feature selection algorithm by incorporating noise into the shadow variables. Drawing parallels from the perturbation analysis framework of artificial neural networks, this evolved version of the Boruta method is presented. Rigorous testing on four publicly available benchmark datasets revealed that this proposed technique outperforms the classic Boruta algorithm, underscoring its potential for enhanced, accurate feature selection.
Feature Selection Using Boruta
Feature Selection is a crucial step in machine learning. In feature selection we select relevant features to our model. Features which give useful information about the data and improve the accuracy of the model is all a Data Scientist needs. Finding out the relevant features is the tough job in end to end projects. There are a lot of methods for feature selection.
Feature Importance -- How's and Why's
In this article, we will be exploring various feature selection techniques that we need to be familiar with, in order to get the best performance out of your model. SelectKbest is a method provided by sklearn to rank features of a dataset by their "importance "with respect to the target variable. This "importance" is calculated using a score function which can be one of the following: All of the above-mentioned scoring functions are based on statistics. For instance, the f_regression function arranges the p_values of each of the variables in increasing order and picks the best K columns with the least p_value. Features with a p_value of less than 0.05 are considered "significant" and only these features should be used in the predictive model.
Sequential Feature Classification in the Context of Redundancies
Pfannschmidt, Lukas, Hammer, Barbara
The problem of all-relevant feature selection is concerned with finding a relevant feature set with preserved redundancies. There exist several approximations to solve this problem but only one could give a distinction between strong and weak relevance. This approach was limited to the case of linear problems. In this work, we present a new solution for this distinction in the non-linear case through the use of random forest models and statistical methods.
Feature Selection: Beyond feature importance? - KDnuggets
In machine learning, Feature Selection is the process of choosing features that are most useful for your prediction. Although it sounds simple it is one of the most complex problems in the work of creating a new machine learning model. In this post, I will share with you some of the approaches that were researched during the last project I led at Fiverr. You will get some ideas on the basic method I tried and also the more complex approach, which got the best results -- removing over 60% of the features, while maintaining accuracy and achieving more stability for our model. I'll also be sharing our improvement to this algorithm.
varrank: an R package for variable ranking based on mutual information with applications to observed systemic datasets
Kratzer, Gilles, Furrer, Reinhard
This article describes the R package varrank. It has a flexible implementation of heuristic approaches which perform variable ranking based on mutual information. The package is particularly suitable for exploring multivariate datasets requiring a holistic analysis. The core functionality is a general implementation of the minimum redundancy maximum relevance (mRMRe) model. This approach is based on information theory metrics. It is compatible with discrete and continuous data which are discretised using a large choice of possible rules. The two main problems that can be addressed by this package are the selection of the most representative variables for modeling a collection of variables of interest, i.e., dimension reduction, and variable ranking with respect to a set of variables of interest.
R Addict Blog
Feature selection is a process of extracting valuable features that have significant influence on dependent variable. This is still an active field of research and machine wandering. In this post I compare few feature selection algorithms: traditional GLM with regularization, computationally demanding Boruta and entropy based filter from FSelectorRcpp (free of Java/Weka) package. Check out the comparison on Venn Diagram carried out on data from the RTCGA factory of R data packages. I would like to thank Magda Sobiczewska and pbiecek for inspiration for this comparison.