Goto

Collaborating Authors

 Decision Tree Learning


Imputation using training labels and classification via label imputation

arXiv.org Machine Learning

Missing data is a common problem in practical settings. Various imputation methods have been developed to deal with missing data. However, even though the label is usually available in the training data, the common practice of imputation usually only relies on the input and ignores the label. In this work, we illustrate how stacking the label into the input can significantly improve the imputation of the input. In addition, we propose a classification strategy that initializes the predicted test label with missing values and stacks the label with the input for imputation. This allows imputing the label and the input at the same time. Also, the technique is capable of handling data training with missing labels without any prior imputation and is applicable to continuous, categorical, or mixed-type data. Experiments show promising results in terms of accuracy.


Spatial and Temporal Characteristics of Freight Tours: A Data-Driven Exploratory Analysis

arXiv.org Artificial Intelligence

This paper presents a modeling approach to infer scheduling and routing patterns from digital freight transport activity data for different freight markets. We provide a complete modeling framework including a new discrete-continuous decision tree approach for extracting rules from the freight transport data. We apply these models to collected tour data for the Netherlands to understand departure time patterns and tour strategies, also allowing us to evaluate the effectiveness of the proposed algorithm. We find that spatial and temporal characteristics are important to capture the types of tours and time-of-day patterns of freight activities. Also, the empirical evidence indicates that carriers in most of the transport markets are sensitive to the level of congestion. Many of them adjust the type of tour, departure time, and the number of stops per tour when facing a congested zone. The results can be used by practitioners to get more grip on transport markets and develop freight and traffic management measures.


A review of ensemble learning and data augmentation models for class imbalanced problems: combination, implementation and evaluation

arXiv.org Machine Learning

Class imbalance (CI) in classification problems arises when the number of observations belonging to one class is lower than the other. Ensemble learning combines multiple models to obtain a robust model and has been prominently used with data augmentation methods to address class imbalance problems. In the last decade, a number of strategies have been added to enhance ensemble learning and data augmentation methods, along with new methods such as generative adversarial networks (GANs). A combination of these has been applied in many studies, and the evaluation of different combinations would enable a better understanding and guidance for different application domains. In this paper, we present a computational study to evaluate data augmentation and ensemble learning methods used to address prominent benchmark CI problems. We present a general framework that evaluates 9 data augmentation and 9 ensemble learning methods for CI problems. Our objective is to identify the most effective combination for improving classification performance on imbalanced datasets. The results indicate that combinations of data augmentation methods with ensemble learning can significantly improve classification performance on imbalanced datasets. We find that traditional data augmentation methods such as the synthetic minority oversampling technique (SMOTE) and random oversampling (ROS) are not only better in performance for selected CI problems, but also computationally less expensive than GANs. Our study is vital for the development of novel models for handling imbalanced datasets.


Parallel Coordinates for Discovery of Interpretable Machine Learning Models

arXiv.org Artificial Intelligence

This work uses visual knowledge discovery in parallel coordinates to advance methods of interpretable machine learning. The graphic data representation in parallel coordinates made the concepts of hypercubes and hyperblocks (HBs) simple to understand for end users. It is suggested to use mixed and pure hyperblocks in the proposed data classifier algorithm Hyper. It is shown that Hyper models generalize decision trees. The algorithm is presented in several settings and options to discover interactively or automatically overlapping or non-overlapping hyperblocks. Additionally, the use of hyperblocks in conjunction with language descriptions of visual patterns is demonstrated. The benchmark data from the UCI ML repository were used to evaluate the Hyper algorithm. It enabled the discovery of mixed and pure HBs evaluated using 10-fold cross validation. Connections among hyperblocks, dimension reduction and visualization have been established. The capability of end users to find and observe hyperblocks, as well as the ability of side-by-side visualizations to make patterns evident, are among major advantages ofhyperblock technology and the Hyper algorithm. A new method to visualize incomplete n-D data with missing values is proposed, while the traditional parallel coordinates do not support it. The ability of HBs to better prevent both overgeneralization and overfitting of data over decision trees is demonstrated as another benefit of the hyperblocks. The features of VisCanvas 2.0 software tool that implements Hyper technology are presented.


Supervised Feature Compression based on Counterfactual Analysis

arXiv.org Artificial Intelligence

Counterfactual Explanations are becoming a de-facto standard in post-hoc interpretable machine learning. For a given classifier and an instance classified in an undesired class, its counterfactual explanation corresponds to small perturbations of that instance that allows changing the classification outcome. This work aims to leverage Counterfactual Explanations to detect the important decision boundaries of a pre-trained black-box model. This information is used to build a supervised discretization of the features in the dataset with a tunable granularity. Using the discretized dataset, an optimal Decision Tree can be trained that resembles the black-box model, but that is interpretable and compact. Numerical results on real-world datasets show the effectiveness of the approach in terms of accuracy and sparsity.


Example-Based Explanations of Random Forest Predictions

arXiv.org Artificial Intelligence

A random forest prediction can be computed by the scalar product of the labels of the training examples and a set of weights that are determined by the leafs of the forest into which the test object falls; each prediction can hence be explained exactly by the set of training examples for which the weights are non-zero. The number of examples used in such explanations is shown to vary with the dimensionality of the training set and hyperparameters of the random forest algorithm. This means that the number of examples involved in each prediction can to some extent be controlled by varying these parameters. However, for settings that lead to a required predictive performance, the number of examples involved in each prediction may be unreasonably large, preventing the user to grasp the explanations. In order to provide more useful explanations, a modified prediction procedure is proposed, which includes only the top-weighted examples. An investigation on regression and classification tasks shows that the number of examples used in each explanation can be substantially reduced while maintaining, or even improving, predictive performance compared to the standard prediction procedure.


Early Detection of Bark Beetle Attack Using Remote Sensing and Machine Learning: A Review

arXiv.org Artificial Intelligence

This paper provides a comprehensive review of past and current advances in the early detection of bark beetle-induced tree mortality from three primary perspectives: bark beetle & host interactions, RS, and ML/DL. In contrast to prior efforts, this review encompasses all RS systems and emphasizes ML/DL methods to investigate their strengths and weaknesses. We parse existing literature based on multi- or hyper-spectral analyses and distill their knowledge based on: bark beetle species & attack phases with a primary emphasis on early stages of attacks, host trees, study regions, RS platforms & sensors, spectral/spatial/temporal resolutions, spectral signatures, spectral vegetation indices (SVIs), ML approaches, learning schemes, task categories, models, algorithms, classes/clusters, features, and DL networks & architectures. Although DL-based methods and the random forest (RF) algorithm showed promising results, highlighting their potential to detect subtle changes across visible, thermal, and short-wave infrared (SWIR) spectral regions, they still have limited effectiveness and high uncertainties. To inspire novel solutions to these shortcomings, we delve into the principal challenges & opportunities from different perspectives, enabling a deeper understanding of the current state of research and guiding future research directions.


Continuous Authentication Using Mouse Clickstream Data Analysis

arXiv.org Artificial Intelligence

Biometrics is used to authenticate an individual based on physiological or behavioral traits. Mouse dynamics is an example of a behavioral biometric that can be used to perform continuous authentication as protection against security breaches. Recent research on mouse dynamics has shown promising results in identifying users; however, it has not yet reached an acceptable level of accuracy. In this paper, an empirical evaluation of different classification techniques is conducted on a mouse dynamics dataset, the Balabit Mouse Challenge dataset. User identification is carried out using three mouse actions: mouse move, point and click, and drag and drop. Verification and authentication methods are conducted using three machine-learning classifiers: the Decision Tree classifier, the K-Nearest Neighbors classifier, and the Random Forest classifier. The results show that the three classifiers can distinguish between a genuine user and an impostor with a relatively high degree of accuracy. In the verification mode, all the classifiers achieve a perfect accuracy of 100%. In authentication mode, all three classifiers achieved the highest accuracy (ACC) and Area Under Curve (AUC) from scenario B using the point and click action data: (Decision Tree ACC:87.6%,


Explainable artificial intelligence model for identifying Market Value in Professional Soccer Players

arXiv.org Artificial Intelligence

This study introduces an advanced machine learning method for predicting soccer players' market values, combining ensemble models and the Shapley Additive Explanations (SHAP) for interpretability. Utilizing data from about 12,000 players from Sofifa, the Boruta algorithm streamlined feature selection. The Gradient Boosting Decision Tree (GBDT) model excelled in predictive accuracy, with an R-squared of 0.901 and a Root Mean Squared Error (RMSE) of 3,221,632.175. Player attributes in skills, fitness, and cognitive areas significantly influenced market value. These insights aid sports industry stakeholders in player valuation. However, the study has limitations, like underestimating superstar players' values and needing larger datasets. Future research directions include enhancing the model's applicability and exploring value prediction in various contexts.


SecureCut: Federated Gradient Boosting Decision Trees with Efficient Machine Unlearning

arXiv.org Artificial Intelligence

In response to legislation mandating companies to honor the \textit{right to be forgotten} by erasing user data, it has become imperative to enable data removal in Vertical Federated Learning (VFL) where multiple parties provide private features for model training. In VFL, data removal, i.e., \textit{machine unlearning}, often requires removing specific features across all samples under privacy guarentee in federated learning. To address this challenge, we propose \methname, a novel Gradient Boosting Decision Tree (GBDT) framework that effectively enables both \textit{instance unlearning} and \textit{feature unlearning} without the need for retraining from scratch. Leveraging a robust GBDT structure, we enable effective data deletion while reducing degradation of model performance. Extensive experimental results on popular datasets demonstrate that our method achieves superior model utility and forgetfulness compared to \textit{state-of-the-art} methods. To our best knowledge, this is the first work that investigates machine unlearning in VFL scenarios.