nominal feature
RO-FIGS: Efficient and Expressive Tree-Based Ensembles for Tabular Data
Matjašec, Urška, Simidjievski, Nikola, Jamnik, Mateja
Tree-based models are often robust to uninformative features and can accurately capture non-smooth, complex decision boundaries. Consequently, they often outperform neural network-based models on tabular datasets at a significantly lower computational cost. Nevertheless, the capability of traditional tree-based ensembles to express complex relationships efficiently is limited by using a single feature to make splits. To improve the efficiency and expressiveness of tree-based methods, we propose Random Oblique Fast Interpretable Greedy-Tree Sums (RO-FIGS). RO-FIGS builds on Fast Interpretable Greedy-Tree Sums, and extends it by learning trees with oblique or multivariate splits, where each split consists of a linear combination learnt from random subsets of features. This helps uncover interactions between features and improves performance. The proposed method is suitable for tabular datasets with both numerical and categorical features. We evaluate RO-FIGS on 22 real-world tabular datasets, demonstrating superior performance and much smaller models over other tree- and neural network-based methods. Additionally, we analyse their splits to reveal valuable insights into feature interactions, enriching the information learnt from SHAP summary plots, and thereby demonstrating the enhanced interpretability of RO-FIGS models. The proposed method is well-suited for applications, where balance between accuracy and interpretability is essential.
R2VF: A Two-Step Regularization Algorithm to Cluster Categories in GLMs
Over recent decades, extensive research has aimed to overcome the restrictive underlying assumptions required for a Generalized Linear Model to generate accurate and meaningful predictions. These efforts include regularizing coefficients, selecting features, and clustering ordinal categories, among other approaches. Despite these advances, efficiently clustering nominal categories in GLMs without incurring high computational costs remains a challenge. This paper introduces Ranking to Variable Fusion (R2VF), a two-step method designed to efficiently fuse nominal and ordinal categories in GLMs. By first transforming nominal features into an ordinal framework via regularized regression and then applying variable fusion, R2VF strikes a balance between model complexity and interpretability. We demonstrate the effectiveness of R2VF through comparisons with other methods, highlighting its performance in addressing overfitting and finding a proper set of covariates.
In to Decision Trees Part: 2. Hi! Hello and thanks for reading this…
If you missed the previous part of this blog "In to Decision Trees Part: 1", please visit it. In this blog, we will explore more about Decision Trees Algorithm and its capability. The Process of Decision trees on Numerical (Discrete/Continuous) features is slightly different than categorical features. While the Decision trees can handle categorical variables with ease. There are two types of categorical values.
Generalization Properties of Decision Trees on Real-valued and Categorical Features
Leboeuf, Jean-Samuel, LeBlanc, Frédéric, Marchand, Mario
We revisit binary decision trees from the perspective of partitions of the data. We introduce the notion of partitioning function, and we relate it to the growth function and to the VC dimension. We consider three types of features: real-valued, categorical ordinal and categorical nominal, with different split rules for each. For each feature type, we upper bound the partitioning function of the class of decision stumps before extending the bounds to the class of general decision tree (of any fixed structure) using a recursive approach. Using these new results, we are able to find the exact VC dimension of decision stumps on examples of $\ell$ real-valued features, which is given by the largest integer $d$ such that $2\ell \ge \binom{d}{\lfloor\frac{d}{2}\rfloor}$. Furthermore, we show that the VC dimension of a binary tree structure with $L_T$ leaves on examples of $\ell$ real-valued features is in $O(L_T \log(L_T\ell))$. Finally, we elaborate a pruning algorithm based on these results that performs better than the cost-complexity and reduced-error pruning algorithms on a number of data sets, with the advantage that no cross-validation is required.
SMOTE-ENC: A novel SMOTE-based method to generate synthetic data for nominal and continuous features
Mukherjee, Mimi, Khushi, Matloob
Real world datasets are heavily skewed where some classes are significantly outnumbered by the other classes. In these situations, machine learning algorithms fail to achieve substantial efficacy while predicting these under-represented instances. To solve this problem, many variations of synthetic minority over-sampling methods (SMOTE) have been proposed to balance the dataset which deals with continuous features. However, for datasets with both nominal and continuous features, SMOTE-NC is the only SMOTE-based over-sampling technique to balance the data. In this paper, we present a novel minority over-sampling method, SMOTE-ENC (SMOTE - Encoded Nominal and Continuous), in which, nominal features are encoded as numeric values and the difference between two such numeric value reflects the amount of change of association with minority class. Our experiments show that the classification model using SMOTE-ENC method offers better prediction than model using SMOTE-NC when the dataset has a substantial number of nominal features and also when there is some association between the categorical features and the target class. Additionally, our proposed method addressed one of the major limitations of SMOTE-NC algorithm. SMOTE-NC can be applied only on mixed datasets that have features consisting of both continuous and nominal features and cannot function if all the features of the dataset are nominal. Our novel method has been generalized to be applied on both mixed datasets and on nominal only datasets. The code is available from mkhushi.github.io
Network Intrusion Detection based on LSTM and Feature Embedding
Gwon, Hyeokmin, Lee, Chungjun, Keum, Rakun, Choi, Heeyoul
Growing number of network devices and services have led to increasing demand for protective measures as hackers launch attacks to paralyze or steal information from victim systems. Intrusion Detection System (IDS) is one of the essential elements of network perimeter security which detects the attacks by inspecting network traffic packets or operating system logs. While existing works demonstrated effectiveness of various machine learning techniques, only few of them utilized the time-series information of network traffic data. Also, categorical information has not been included in neural network based approaches. In this paper, we propose network intrusion detection models based on sequential information using long short-term memory (LSTM) network and categorical information using the embedding technique. We have experimented the models with UNSW-NB15, which is a comprehensive network traffic dataset. The experiment results confirm that the proposed method improve the performance, observing binary classification accuracy of 99.72\%.
Why You Should Build XGBoost Models Within H2O - Sefik Ilkin Serengil
XGBoost triggered the rise of the tree based models in the machine learning world. It earns reputation with its robust models. Its built models mostly get almost 2% more accuracy. On the other hand, it is a fact that XGBoost is almost 10 times slower than LightGBM. Speed means a lot in a data challenge.
Introduction to Machine Learning - CodeProject
"Machine learning is the field of study that gives computers the ability to learn without being explicitly programmed" Arthur Smauel. We can think of machine learning as approach to automate tasks like predictions or modelling. For example, consider an email spam filter system, instead of having programmers manually looking at the emails and coming up with spam rules. We can use a machine learning algorithm and feed it input data (emails) and it will automatically discover rules that are powerful enough to distinguish spam emails. Machine learning is used in many application nowadays like spam detection in emails or movie recommendation systems that tells you movies that you might like based on your viewing history.
A Hierarchical Multi-Output Nearest Neighbor Model for Multi-Output Dependence Learning
Morris, Richard G., Martinez, Tony, Smith, Michael R.
Multi-Output Dependence (MOD) learning is a generalization of standard classification problems that allows for multiple outputs that are dependent on each other. A primary issue that arises in the context of MOD learning is that for any given input pattern there can be multiple correct output patterns. This changes the learning task from function approximation to relation approximation. Previous algorithms do not consider this problem, and thus cannot be readily applied to MOD problems. To perform MOD learning, we introduce the Hierarchical Multi-Output Nearest Neighbor model (HMONN) that employs a basic learning model for each output and a modified nearest neighbor approach to refine the initial results.