Goto

Collaborating Authors

 Decision Tree Learning


Imbalanced Regression Pipeline Recommendation

arXiv.org Artificial Intelligence

Imbalanced problems are prevalent in various real-world scenarios and are extensively explored in classification tasks. However, they also present challenges for regression tasks due to the rarity of certain target values. A common alternative is to employ balancing algorithms in preprocessing to address dataset imbalance. However, due to the variety of resampling methods and learning models, determining the optimal solution requires testing many combinations. Furthermore, the learning model, dataset, and evaluation metric affect the best strategies. This work proposes the Meta-learning for Imbalanced Regression (Meta-IR) framework, which diverges from existing literature by training meta-classifiers to recommend the best pipeline composed of the resampling strategy and learning model per task in a zero-shot fashion. The meta-classifiers are trained using a set of meta-features to learn how to map the meta-features to the classes indicating the best pipeline. We propose two formulations: Independent and Chained. Independent trains the meta-classifiers to separately indicate the best learning algorithm and resampling strategy. Chained involves a sequential procedure where the output of one meta-classifier is used as input for another to model intrinsic relationship factors. The Chained scenario showed superior performance, suggesting a relationship between the learning algorithm and the resampling strategy per task. Compared with AutoML frameworks, Meta-IR obtained better results. Moreover, compared with baselines of six learning algorithms and six resampling algorithms plus no resampling, totaling 42 (6 X 7) configurations, Meta-IR outperformed all of them. The code, data, and further information of the experiments can be found on GitHub: https://github.com/JusciAvelino/Meta-IR.


Trustworthy Tree-based Machine Learning by $MoS_2$ Flash-based Analog CAM with Inherent Soft Boundaries

arXiv.org Artificial Intelligence

The rapid advancement of artificial intelligence has raised concerns regarding its trustworthiness, especially in terms of interpretability and robustness. Tree-based models like Random Forest and XGBoost excel in interpretability and accuracy for tabular data, but scaling them remains computationally expensive due to poor data locality and high data dependence. Previous efforts to accelerate these models with analog content addressable memory (CAM) have struggled, due to the fact that the difficult-to-implement sharp decision boundaries are highly susceptible to device variations, which leads to poor hardware performance and vulnerability to adversarial attacks. This work presents a novel hardware-software co-design approach using $MoS_2$ Flash-based analog CAM with inherent soft boundaries, enabling efficient inference with soft tree-based models. Our soft tree model inference experiments on $MoS_2$ analog CAM arrays show this method achieves exceptional robustness against device variation and adversarial attacks while achieving state-of-the-art accuracy. Specifically, our fabricated analog CAM arrays achieve $96\%$ accuracy on Wisconsin Diagnostic Breast Cancer (WDBC) database, while maintaining decision explainability. Our experimentally calibrated model validated only a $0.6\%$ accuracy drop on the MNIST dataset under $10\%$ device threshold variation, compared to a $45.3\%$ drop for traditional decision trees. This work paves the way for specialized hardware that enhances AI's trustworthiness and efficiency.


On the Learning with Augmented Class via Forests

arXiv.org Artificial Intelligence

Decision trees and forests have achieved successes in various real applications, most working with all testing classes known in training data. In this work, we focus on learning with augmented class via forests, where an augmented class may appear in testing data yet not in training data. We incorporate information of augmented class into trees' splitting, that is, augmented Gini impurity, a new splitting criterion is introduced to exploit some unlabeled data from testing distribution. We then develop the Learning with Augmented Class via Forests (short for LACForest) approach, which constructs shallow forests according to the augmented Gini impurity and then splits forests with pseudo-labeled augmented instances for better performance. We also develop deep neural forests via an optimization objective based on our augmented Gini impurity, which essentially utilizes the representation power of neural networks for forests. Theoretically, we present the convergence analysis for our augmented Gini impurity, and we finally conduct experiments to evaluate our approaches.


On the Efficiency of Training Robust Decision Trees

arXiv.org Artificial Intelligence

As machine learning gets adopted into the industry quickly, trustworthiness is increasingly in focus. Yet, efficiency and sustainability of robust training pipelines still have to be established. In this work, we consider a simple pipeline for training adversarially robust decision trees and investigate the efficiency of each step. Our pipeline consists of three stages. Firstly, we choose the perturbation size automatically for each dataset. For that, we introduce a simple algorithm, instead of relying on intuition or prior work. Moreover, we show that the perturbation size can be estimated from smaller models than the one intended for full training, and thus significant gains in efficiency can be achieved. Secondly, we train state-of-the-art adversarial training methods and evaluate them regarding both their training time and adversarial accuracy. Thirdly, we certify the robustness of each of the models thus obtained and investigate the time required for this. We find that verification time, which is critical to the efficiency of the full pipeline, is not correlated with training time.


TRIP: A Nonparametric Test to Diagnose Biased Feature Importance Scores

arXiv.org Machine Learning

Along with accurate prediction, understanding the contribution of each feature to the making of the prediction, i.e., the importance of the feature, is a desirable and arguably necessary component of a machine learning model. For a complex model such as a random forest, such importances are not innate -- as they are, e.g., with linear regression. Efficient methods have been created to provide such capabilities, with one of the most popular among them being permutation feature importance due to its efficiency, model-agnostic nature, and perceived intuitiveness. However, permutation feature importance has been shown to be misleading in the presence of dependent features as a result of the creation of unrealistic observations when permuting the dependent features. In this work, we develop TRIP (Test for Reliable Interpretation via Permutation), a test requiring minimal assumptions that is able to detect unreliable permutation feature importance scores that are the result of model extrapolation. To build on this, we demonstrate how the test can be complemented in order to allow its use in high dimensional settings. Through testing on simulated data and applications, our results show that the test can be used to reliably detect when permutation feature importance scores are unreliable.


Leo Breiman, the Rashomon Effect, and the Occam Dilemma

arXiv.org Machine Learning

In the famous Two Cultures paper, Leo Breiman provided a visionary perspective on the cultures of ''data models'' (modeling with consideration of data generation) versus ''algorithmic models'' (vanilla machine learning models). I provide a modern perspective on these approaches. One of Breiman's key arguments against data models is the ''Rashomon Effect,'' which is the existence of many different-but-equally-good models. The Rashomon Effect implies that data modelers would not be able to determine which model generated the data. Conversely, one of his core advantages in favor of data models is simplicity, as he claimed there exists an ''Occam Dilemma,'' i.e., an accuracy-simplicity tradeoff. After 25 years of powerful computers, it has become clear that this claim is not generally true, in that algorithmic models do not need to be complex to be accurate; however, there are nuances that help explain Breiman's logic, specifically, that by ''simple,'' he appears to consider only linear models or unoptimized decision trees. Interestingly, the Rashomon Effect is a key tool in proving the nullification of the Occam Dilemma. To his credit though, Breiman did not have the benefit of modern computers, with which my observations are much easier to make. Breiman's goal for interpretability was somewhat intertwined with causality: simpler models can help reveal which variables have a causal relationship with the outcome. However, I argue that causality can be investigated without the use of single models, whether or not they are simple. Interpretability is useful in its own right, and I think Breiman knew that too. Technically, my modern perspective does not belong to either of Breiman's Two Cultures, but shares the goals of both of them - causality, simplicity, accuracy - and shows that these goals can be accomplished in other ways, without the limitations Breiman was concerned about.


Diversity Conscious Refined Random Forest

arXiv.org Artificial Intelligence

Random Forest (RF) is a widely used ensemble learning technique known for its robust classification performance across diverse domains. However, it often relies on hundreds of trees and all input features, leading to high inference cost and model redundancy. In this work, our goal is to grow trees dynamically only on informative features and then enforce maximal diversity by clustering and retaining uncorrelated trees. Therefore, we propose a Refined Random Forest Classifier that iteratively refines itself by first removing the least informative features and then analytically determines how many new trees should be grown, followed by correlation-based clustering to remove redundant trees. The classification accuracy of our model was compared against the standard RF on the same number of trees. Experiments on 8 multiple benchmark datasets, including binary and multiclass datasets, demonstrate that the proposed model achieves improved accuracy compared to standard RF.


Predicting and Explaining Customer Data Sharing in the Open Banking

arXiv.org Artificial Intelligence

The emergence of Open Banking represents a significant shift in financial data management, influencing financial institutions' market dynamics and marketing strategies. This increased competition creates opportunities and challenges, as institutions manage data inflow to improve products and services while mitigating data outflow that could aid competitors. This study introduces a framework to predict customers' propensity to share data via Open Banking and interprets this behavior through Explanatory Model Analysis (EMA). Using data from a large Brazilian financial institution with approximately 3.2 million customers, a hybrid data balancing strategy incorporating ADASYN and NEARMISS techniques was employed to address the infrequency of data sharing and enhance the training of XGBoost models. These models accurately predicted customer data sharing, achieving 91.39% accuracy for inflow and 91.53% for outflow. The EMA phase combined the Shapley Additive Explanations (SHAP) method with the Classification and Regression Tree (CART) technique, revealing the most influential features on customer decisions. Key features included the number of transactions and purchases in mobile channels, interactions within these channels, and credit-related features, particularly credit card usage across the national banking system. These results highlight the critical role of mobile engagement and credit in driving customer data-sharing behaviors, providing financial institutions with strategic insights to enhance competitiveness and innovation in the Open Banking environment.


Targeted tuning of random forests for quantile estimation and prediction intervals

arXiv.org Machine Learning

We present a novel tuning procedure for random forests (RFs) that improves the accuracy of estimated quantiles and produces valid, relatively narrow prediction intervals. While RFs are typically used to estimate mean responses (conditional on covariates), they can also be used to estimate quantiles by estimating the full distribution of the response. However, standard approaches for building RFs often result in excessively biased quantile estimates. To reduce this bias, our proposed tuning procedure minimizes "quantile coverage loss" (QCL), which we define as the estimated bias of the marginal quantile coverage probability estimate based on the out-of-bag sample. We adapt QCL tuning to handle censored data and demonstrate its use with random survival forests. We show that QCL tuning results in quantile estimates with more accurate coverage probabilities than those achieved using default parameter values or traditional tuning (using MSPE for uncensored data and C-index for censored data), while also reducing the estimated MSE of these coverage probabilities. We discuss how the superior performance of QCL tuning is linked to its alignment with the estimation goal. Finally, we explore the validity and width of prediction intervals created using this method.


LSH-DynED: A Dynamic Ensemble Framework with LSH-Based Undersampling for Evolving Multi-Class Imbalanced Classification

arXiv.org Artificial Intelligence

The classification of imbalanced data streams, which have unequal class distributions, is a key difficulty in machine learning, especially when dealing with multiple classes. While binary imbalanced data stream classification tasks have received considerable attention, only a few studies have focused on multi-class imbalanced data streams. Effectively managing the dynamic imbalance ratio is a key challenge in this domain. This study introduces a novel, robust, and resilient approach to address these challenges by integrating Locality Sensitive Hashing with Random Hyperplane Projections (LSH-RHP) into the Dynamic Ensemble Diversification (DynED) framework. To the best of our knowledge, we present the first application of LSH-RHP for undersampling in the context of imbalanced non-stationary data streams. The proposed method undersamples the majority classes by utilizing LSH-RHP, provides a balanced training set, and improves the ensemble's prediction performance. We conduct comprehensive experiments on 23 real-world and ten semi-synthetic datasets and compare LSH-DynED with 15 state-of-the-art methods. The results reveal that LSH-DynED outperforms other approaches in terms of both Kappa and mG-Mean effectiveness measures, demonstrating its capability in dealing with multi-class imbalanced non-stationary data streams. Notably, LSH-DynED performs well in large-scale, high-dimensional datasets with considerable class imbalances and demonstrates adaptation and robustness in real-world circumstances. To motivate our design, we review existing methods for imbalanced data streams, outline key challenges, and offer guidance for future work. For the reproducibility of our results, we have made our implementation available on GitHub.