Decision Tree Learning
Robust COVID-19 Detection from Cough Sounds using Deep Neural Decision Tree and Forest: A Comprehensive Cross-Datasets Evaluation
Islam, Rofiqul, Chowdhury, Nihad Karim, Kabir, Muhammad Ashad
This research presents a robust approach to classifying COVID-19 cough sounds using cutting-edge machine-learning techniques. Leveraging deep neural decision trees and deep neural decision forests, our methodology demonstrates consistent performance across diverse cough sound datasets. We begin with a comprehensive extraction of features to capture a wide range of audio features from individuals, whether COVID-19 positive or negative. To determine the most important features, we use recursive feature elimination along with cross-validation. Bayesian optimization fine-tunes hyper-parameters of deep neural decision tree and deep neural decision forest models. Additionally, we integrate the SMOTE during training to ensure a balanced representation of positive and negative data. Model performance refinement is achieved through threshold optimization, maximizing the ROC-AUC score. Our approach undergoes a comprehensive evaluation in five datasets: Cambridge, Coswara, COUGHVID, Virufy, and the combined Virufy with the NoCoCoDa dataset. Consistently outperforming state-of-the-art methods, our proposed approach yields notable AUC scores of 0.97, 0.98, 0.92, 0.93, 0.99, and 0.99 across the respective datasets. Merging all datasets into a combined dataset, our method, using a deep neural decision forest classifier, achieves an AUC of 0.97. Also, our study includes a comprehensive cross-datasets analysis, revealing demographic and geographic differences in the cough sounds associated with COVID-19. These differences highlight the challenges in transferring learned features across diverse datasets and underscore the potential benefits of dataset integration, improving generalizability and enhancing COVID-19 detection from audio signals.
Random Forest Regression Feature Importance for Climate Impact Pathway Detection
Brown, Meredith G. L., Peterson, Matt, Tezaur, Irina, Peterson, Kara, Bull, Diana
Disturbances to the climate system, both natural and anthropogenic, have far reaching impacts that are not always easy to identify or quantify using traditional climate science analyses or causal modeling techniques. In this paper, we develop a novel technique for discovering and ranking the chain of spatio-temporal downstream impacts of a climate source, referred to herein as a source-impact pathway, using Random Forest Regression (RFR) and SHapley Additive exPlanation (SHAP) feature importances. Rather than utilizing RFR for classification or regression tasks (the most common use case for RFR), we propose a fundamentally new workflow in which we: (i) train random forest (RF) regressors on a set of spatio-temporal features of interest, (ii) calculate their pair-wise feature importances using the SHAP weights associated with those features, and (iii) translate these feature importances into a weighted pathway network (i.e., a weighted directed graph), which can be used to trace out and rank interdependencies between climate features and/or modalities. Importantly, while herein we employ RFR and SHAP feature importance in steps (i) and (ii) of our algorithm, our novel workflow is in no way tied to these approaches, which could be replaced with any regression method and sensitivity method. We adopt a tiered verification approach to verify our new pathway identification methodology. In this approach, we apply our method to ensembles of data generated by running two increasingly complex benchmarks: (i) a set of synthetic coupled equations, and (ii) a fully coupled simulation of the 1991 eruption of Mount Pinatubo in the Philippines performed using a modified version 2 of the U.S. Department of Energy's Energy Exascale Earth System Model (E3SMv2). We find that our RFR feature importance-based approach can accurately detect known pathways of impact for both test cases.
Projected random forests and conformal prediction of circular data
F., Paulo C. Marques, Artes, Rinaldo, Graziadei, Helton
We apply split conformal prediction techniques to regression problems with circular responses by introducing a suitable conformity score, leading to prediction sets with adaptive arc length and finite-sample coverage guarantees for any circular predictive model under exchangeable data. Leveraging the high performance of existing predictive models designed for linear responses, we analyze a general projection procedure that converts any linear response regression model into one suitable for circular responses. When random forests serve as basis models in this projection procedure, we harness the out-of-bag dynamics to eliminate the necessity for a separate calibration sample in the construction of prediction sets. For synthetic and real datasets the resulting projected random forests model produces more efficient out-of-bag conformal prediction sets, with shorter median arc length, when compared to the split conformal prediction sets generated by two existing alternative models.
A Semi-supervised CART Model for Covariate Shift
Cai, Mingyang, Klausch, Thomas, van de Wiel, Mark A.
Machine learning models used in medical applications often face challenges due to the covariate shift, which occurs when there are discrepancies between the distributions of training and target data. This can lead to decreased predictive accuracy, especially with unknown outcomes in the target data. This paper introduces a semi-supervised classification and regression tree (CART) that uses importance weighting to address these distribution discrepancies. Our method improves the predictive performance of the CART model by assigning greater weights to training samples that more accurately represent the target distribution, especially in cases of covariate shift without target outcomes. In addition to CART, we extend this weighted approach to generalized linear model trees and tree ensembles, creating a versatile framework for managing the covariate shift in complex datasets. Through simulation studies and applications to real-world medical data, we demonstrate significant improvements in predictive accuracy. These findings suggest that our weighted approach can enhance reliability in medical applications and other fields where the covariate shift poses challenges to model performance across various data distributions.
Cultivating Archipelago of Forests: Evolving Robust Decision Trees through Island Coevolution
Żychowski, Adam, Perrault, Andrew, Mańdziuk, Jacek
Decision trees are widely used in machine learning due to their simplicity and interpretability, but they often lack robustness to adversarial attacks and data perturbations. The paper proposes a novel island-based coevolutionary algorithm (ICoEvoRDF) for constructing robust decision tree ensembles. The algorithm operates on multiple islands, each containing populations of decision trees and adversarial perturbations. The populations on each island evolve independently, with periodic migration of top-performing decision trees between islands. This approach fosters diversity and enhances the exploration of the solution space, leading to more robust and accurate decision tree ensembles. ICoEvoRDF utilizes a popular game theory concept of mixed Nash equilibrium for ensemble weighting, which further leads to improvement in results. ICoEvoRDF is evaluated on 20 benchmark datasets, demonstrating its superior performance compared to state-of-the-art methods in optimizing both adversarial accuracy and minimax regret. The flexibility of ICoEvoRDF allows for the integration of decision trees from various existing methods, providing a unified framework for combining diverse solutions. Our approach offers a promising direction for developing robust and interpretable machine learning models
Splitting criteria for ordinal decision trees: an experimental study
Ayllón-Gavilán, Rafael, Martínez-Estudillo, Francisco José, Guijo-Rubio, David, Hervás-Martínez, César, Gutiérrez, Pedro Antonio
Ordinal Classification (OC) is a machine learning field that addresses classification tasks where the labels exhibit a natural order. Unlike nominal classification, which treats all classes as equally distinct, OC takes the ordinal relationship into account, producing more accurate and relevant results. This is particularly critical in applications where the magnitude of classification errors has implications. Despite this, OC problems are often tackled using nominal methods, leading to suboptimal solutions. Although decision trees are one of the most popular classification approaches, ordinal tree-based approaches have received less attention when compared to other classifiers. This work conducts an experimental study of tree-based methodologies specifically designed to capture ordinal relationships. A comprehensive survey of ordinal splitting criteria is provided, standardising the notations used in the literature for clarity. Three ordinal splitting criteria, Ordinal Gini (OGini), Weighted Information Gain (WIG), and Ranking Impurity (RI), are compared to the nominal counterparts of the first two (Gini and information gain), by incorporating them into a decision tree classifier. An extensive repository considering 45 publicly available OC datasets is presented, supporting the first experimental comparison of ordinal and nominal splitting criteria using well-known OC evaluation metrics. Statistical analysis of the results highlights OGini as the most effective ordinal splitting criterion to date. Source code, datasets, and results are made available to the research community.
The Certainty Ratio $C_\rho$: a novel metric for assessing the reliability of classifier predictions
Evaluating the performance of classifiers is critical in machine learning, particularly in high-stakes applications where the reliability of predictions can significantly impact decision-making. Traditional performance measures, such as accuracy and F-score, often fail to account for the uncertainty inherent in classifier predictions, leading to potentially misleading assessments. This paper introduces the Certainty Ratio ($C_\rho$), a novel metric designed to quantify the contribution of confident (certain) versus uncertain predictions to any classification performance measure. By integrating the Probabilistic Confusion Matrix ($CM^\star$) and decomposing predictions into certainty and uncertainty components, $C_\rho$ provides a more comprehensive evaluation of classifier reliability. Experimental results across 21 datasets and multiple classifiers, including Decision Trees, Naive-Bayes, 3-Nearest Neighbors, and Random Forests, demonstrate that $C_\rho$ reveals critical insights that conventional metrics often overlook. These findings emphasize the importance of incorporating probabilistic information into classifier evaluation, offering a robust tool for researchers and practitioners seeking to improve model trustworthiness in complex environments.
Mastering AI: Big Data, Deep Learning, and the Evolution of Large Language Models -- AutoML from Basics to State-of-the-Art Techniques
Feng, Pohsun, Bi, Ziqian, Wen, Yizhu, Peng, Benji, Liu, Junyu, Yin, Caitlyn Heqi, Wang, Tianyang, Chen, Keyu, Zhang, Sen, Li, Ming, Xu, Jiawei, Liu, Ming, Pan, Xuanhe, Wang, Jinlang, Niu, Qian
In recent years, Artificial Intelligence (AI) and Machine Learning (ML) have grown tremendously in popularity across various industries. From healthcare and finance to retail and automotive, adopting machine learning models has led to significant advancements[1]. However, building machine learning models traditionally requires deep knowledge in multiple areas, such as data preprocessing, feature engineering, model selection, hyperparameter tuning, and evaluation[2]. For many beginners and even experienced practitioners, this process can be time-consuming and technically challenging. This is where AutoML (Automated Machine Learning) comes in. AutoML simplifies the process of building machine learning models by automating many of the steps that would otherwise require manual intervention [3]. AutoML tools can automatically preprocess data, select the most suitable algorithms, and fine-tune hyperparameters to produce highly accurate models [4]. This automation not only speeds up the model development cycle but also allows users without deep knowledge of machine learning to create models with comparable performance to those made by experienced data scientists.
Challenges learning from imbalanced data using tree-based models: Prevalence estimates systematically depend on hyperparameters and can be upwardly biased
Phelps, Nathan, Lizotte, Daniel J., Woolford, Douglas G.
Imbalanced binary classification problems arise in many fields of study. When using machine learning models for these problems, it is common to subsample the majority class (i.e., undersampling) to create a (more) balanced dataset for model training. This biases the model's predictions because the model learns from a dataset that does not follow the same data generating process as new data. One way of accounting for this bias is to analytically map the resulting predictions to new values based on the sampling rate for the majority class, which was used to create the training dataset. While this approach may work well for some machine learning models, we have found that calibrating a random forest this way has unintended negative consequences, including prevalence estimates that can be upwardly biased. These prevalence estimates depend on both i) the number of predictors considered at each split in the random forest; and ii) the sampling rate used. We explain the former using known properties of random forests and analytical calibration. However, in investigating the latter issue, we made a surprising discovery - contrary to the widespread belief that decision trees are biased towards the majority class, they actually can be biased towards the minority class.
Integrating Evidence into the Design of XAI and AI-based Decision Support Systems: A Means-End Framework for End-users in Construction
Love, Peter . E. D., Matthews, Jane, Fang, Weili, Mahamivanan, Hadi
A narrative review is used to develop a theoretical evidence-based means-end framework to build an epistemic foundation to uphold explainable artificial intelligence instruments so that the reliability of outcomes generated from decision support systems can be assured and better explained to end-users. The implications of adopting an evidence-based approach to designing decision support systems in construction are discussed with emphasis placed on evaluating the strength, value, and utility of evidence needed to develop meaningful human explanations for end-users. While the developed means-end framework is focused on end-users, stakeholders can also utilize it to create meaningful human explanations. However, they will vary due to their different epistemic goals. Including evidence in the design and development of explainable artificial intelligence and decision support systems will improve decision-making effectiveness, enabling end-users' epistemic goals to be achieved. The proposed means-end framework is developed from a broad spectrum of literature. Thus, it is suggested that it can be used in construction and other engineering domains where there is a need to integrate evidence into the design of explainable artificial intelligence and decision support systems.