Decision Tree Learning
A Central Limit Theorem for the permutation importance measure
Föge, Nico, Schmid, Lena, Ditzhaus, Marc, Pauly, Markus
Random Forests have become a widely used tool in machine learning since their introduction in 2001, known for their strong performance in classification and regression tasks. One key feature of Random Forests is the Random Forest Permutation Importance Measure (RFPIM), an internal, non-parametric measure of variable importance. While widely used, theoretical work on RFPIM is sparse, and most research has focused on empirical findings. However, recent progress has been made, such as establishing consistency of the RFPIM, although a mathematical analysis of its asymptotic distribution is still missing. In this paper, we provide a formal proof of a Central Limit Theorem for RFPIM using U-Statistics theory. Our approach deviates from the conventional Random Forest model by assuming a random number of trees and imposing conditions on the regression functions and error terms, which must be bounded and additive, respectively. Our result aims at improving the theoretical understanding of RFPIM rather than conducting comprehensive hypothesis testing. However, our contributions provide a solid foundation and demonstrate the potential for future work to extend to practical applications which we also highlight with a small simulation study.
AgroXAI: Explainable AI-Driven Crop Recommendation System for Agriculture 4.0
Turgut, Ozlem, Kok, Ibrahim, Ozdemir, Suat
Today, crop diversification in agriculture is a critical issue to meet the increasing demand for food and improve food safety and quality. This issue is considered to be the most important challenge for the next generation of agriculture due to the diminishing natural resources, the limited arable land, and unpredictable climatic conditions caused by climate change. In this paper, we employ emerging technologies such as the Internet of Things (IoT), machine learning (ML), and explainable artificial intelligence (XAI) to improve operational efficiency and productivity in the agricultural sector. Specifically, we propose an edge computing-based explainable crop recommendation system, AgroXAI, which suggests suitable crops for a region based on weather and soil conditions. In this system, we provide local and global explanations of ML model decisions with methods such as ELI5, LIME, SHAP, which we integrate into ML models. More importantly, we provide regional alternative crop recommendations with the counterfactual explainability method. In this way, we envision that our proposed AgroXAI system will be a platform that provides regional crop diversity in the next generation agriculture.
RL-LLM-DT: An Automatic Decision Tree Generation Method Based on RL Evaluation and LLM Enhancement
Lin, Junjie, Zhao, Jian, Liu, Lin, Deng, Yue, Zhao, Youpeng, Huang, Lanxiao, Lin, Xia, Zhou, Wengang, Li, Houqiang
Traditionally, AI development for two-player zero-sum games has relied on two primary techniques: decision trees and reinforcement learning (RL). A common approach involves using a fixed decision tree as one player's strategy while training an RL agent as the opponent to identify vulnerabilities in the decision tree, thereby improving its strategic strength iteratively. However, this process often requires significant human intervention to refine the decision tree after identifying its weaknesses, resulting in inefficiencies and hindering full automation of the strategy enhancement process. Fortunately, the advent of Large Language Models (LLMs) offers a transformative opportunity to automate the process. We propose RL-LLM-DT, an automatic decision tree generation method based on RL Evaluation and LLM Enhancement. Given an initial decision tree, the method involves two important iterative steps. Response Policy Search: RL is used to discover counter-strategies targeting the decision tree. Policy Improvement: LLMs analyze failure scenarios and generate improved decision tree code. In our method, RL focuses on finding the decision tree's flaws while LLM is prompted to generate an improved version of the decision tree. The iterative refinement process terminates when RL can't find any flaw of the tree or LLM fails to improve the tree. To evaluate the effectiveness of this integrated approach, we conducted experiments in a curling game. After iterative refinements, our curling AI based on the decision tree ranks first on the Jidi platform among 34 curling AIs in total, which demonstrates that LLMs can significantly enhance the robustness and adaptability of decision trees, representing a substantial advancement in the field of Game AI. Our code is available at https://github.com/Linjunjie99/RL-LLM-DT.
Evaluating the Efficacy of Vectocardiographic and ECG Parameters for Efficient Tertiary Cardiology Care Allocation Using Decision Tree Analysis
da Costa, Lucas José, Uemoto, Vinicius Ruiz, de Marchi, Mariana F. N., Hortegal, Renato de Aguiar, de Freitas, Renata Valeri
Use real word data to evaluate the performance of the electrocardiographic markers of GEH as features in a machine learning model with Standard ECG features and Risk Factors in Predicting Outcome of patients in a population referred to a tertiary cardiology hospital. Patients forwarded to specific evaluation in a cardiology specialized hospital performed an ECG and a risk factor anamnesis. A series of follow up attendances occurred in periods of 6 months, 12 months and 15 months to check for cardiovascular related events (mortality or new nonfatal cardiovascular events (Stroke, MI, PCI, CS), as identified during 1-year phone follow-ups. The first attendance ECG was measured by a specialist and processed in order to obtain the global electric heterogeneity (GEH) using the Kors Matriz. The ECG measurements, GEH parameters and risk factors were combined for training multiple instances of XGBoost decision trees models. Each instance were optmized for the AUCPR and the instance with higher AUC is chosen as representative to the model. The importance of each parameter for the winner tree model was compared to better understand the improvement from using GEH parameters. The GEH parameters turned out to have statistical significance for this population specially the QRST angle and the SVG. The combined model with the tree parameters class had the best performance. The findings suggest that using VCG features can facilitate more accurate identification of patients who require tertiary care, thereby optimizing resource allocation and improving patient outcomes. Moreover, the decision tree model's transparency and ability to pinpoint critical features make it a valuable tool for clinical decision-making and align well with existing clinical practices.
Omics-driven hybrid dynamic modeling of bioprocesses with uncertainty estimation
Espinel-Ríos, Sebastián, López, José Montaño, Avalos, José L.
This work presents an omics-driven modeling pipeline that integrates machine-learning tools to facilitate the dynamic modeling of multiscale biological systems. Random forests and permutation feature importance are proposed to mine omics datasets, guiding feature selection and dimensionality reduction for dynamic modeling. Continuous and differentiable machine-learning functions can be trained to link the reduced omics feature set to key components of the dynamic model, resulting in a hybrid model. As proof of concept, we apply this framework to a high-dimensional proteomics dataset of $\textit{Saccharomyces cerevisiae}$. After identifying key intracellular proteins that correlate with cell growth, targeted dynamic experiments are designed, and key model parameters are captured as functions of the selected proteins using Gaussian processes. This approach captures the dynamic behavior of yeast strains under varying proteome profiles while estimating the uncertainty in the hybrid model's predictions. The outlined modeling framework is adaptable to other scenarios, such as integrating additional layers of omics data for more advanced multiscale biological systems, or employing alternative machine-learning methods to handle larger datasets. Overall, this study outlines a strategy for leveraging omics data to inform multiscale dynamic modeling in systems biology and bioprocess engineering.
datadriftR: An R Package for Concept Drift Detection in Predictive Models
Predictive models often face performance degradation due to evolving data distributions, a phenomenon known as data drift. Among its forms, concept drift, where the relationship between explanatory variables and the response variable changes, is particularly challenging to detect and adapt to. Traditional drift detection methods often rely on metrics such as accuracy or variable distributions, which may fail to capture subtle but significant conceptual changes. This paper introduces drifter, an R package designed to detect concept drift, and proposes a novel method called Profile Drift Detection (PDD) that enables both drift detection and an enhanced understanding of the cause behind the drift by leveraging an explainable AI tool - Partial Dependence Profiles (PDPs). The PDD method, central to the package, quantifies changes in PDPs through novel metrics, ensuring sensitivity to shifts in the data stream without excessive computational costs. This approach aligns with MLOps practices, emphasizing model monitoring and adaptive retraining in dynamic environments. The experiments across synthetic and real-world datasets demonstrate that PDD outperforms existing methods by maintaining high accuracy while effectively balancing sensitivity and stability. The results highlight its capability to adaptively retrain models in dynamic environments, making it a robust tool for real-time applications. The paper concludes by discussing the advantages, limitations, and future extensions of the package for broader use cases.
DisCo-DSO: Coupling Discrete and Continuous Optimization for Efficient Generative Design in Hybrid Spaces
Pettit, Jacob F., Lee, Chak Shing, Yang, Jiachen, Ho, Alex, Faissol, Daniel, Petersen, Brenden, Landajuela, Mikel
We consider the challenge of black-box optimization within hybrid discrete-continuous and variable-length spaces, a problem that arises in various applications, such as decision tree learning and symbolic regression. We propose DisCo-DSO (Discrete-Continuous Deep Symbolic Optimization), a novel approach that uses a generative model to learn a joint distribution over discrete and continuous design variables to sample new hybrid designs. In contrast to standard decoupled approaches, in which the discrete and continuous variables are optimized separately, our joint optimization approach uses fewer objective function evaluations, is robust against non-differentiable objectives, and learns from prior samples to guide the search, leading to significant improvement in performance and sample efficiency. Our experiments on a diverse set of optimization tasks demonstrate that the advantages of DisCo-DSO become increasingly evident as the complexity of the problem increases. In particular, we illustrate DisCo-DSO's superiority over the state-of-the-art methods for interpretable reinforcement learning with decision trees.
Extracting PAC Decision Trees from Black Box Binary Classifiers: The Gender Bias Study Case on BERT-based Language Models
Ozaki, Ana, Confalonieri, Roberto, Guimarães, Ricardo, Imenes, Anders
Decision trees are a popular machine learning method, known for their inherent explainability. In Explainable AI, decision trees can be used as surrogate models for complex black box AI models or as approximations of parts of such models. A key challenge of this approach is determining how accurately the extracted decision tree represents the original model and to what extent it can be trusted as an approximation of their behavior. In this work, we investigate the use of the Probably Approximately Correct (PAC) framework to provide a theoretical guarantee of fidelity for decision trees extracted from AI models. Based on theoretical results from the PAC framework, we adapt a decision tree algorithm to ensure a PAC guarantee under certain conditions. We focus on binary classification and conduct experiments where we extract decision trees from BERT-based language models with PAC guarantees. Our results indicate occupational gender bias in these models.
Class flipping for uplift modeling and Heterogeneous Treatment Effect estimation on imbalanced RCT data
Rudaś, Krzysztof, Jaroszewicz, Szymon
In this paper, we focus on data from Randomized Controlled Experiments which guarantee causal interpretation of the outcomes. Class and treatment imbalance are important problems in uplift modeling/HTE, but classical undersampling or oversampling based approaches are hard to apply in this case since they distort the predicted effect. Calibration methods have been proposed in the past, however, they do not guarantee correct predictions. In this work, we propose an approach alternative to undersampling, based on flipping the class value of selected records. We show that the proposed approach does not distort the predicted effect and does not require calibration. The method is especially useful for models based on class variable transformation (modified outcome models). We address those models separately, designing a transformation scheme which guarantees correct predictions and addresses also the problem of treatment imbalance which is especially important for those models. Experiments fully confirm our theoretical results. Additionally, we demonstrate that our method is a viable alternative also for standard classification problems.
A Brief Discussion on KPI Development in Public Administration
Fioretto, Simona, Masciari, Elio, Napolitano, Enea Vincenzo
Efficient and effective service delivery in Public Administration (PA) relies on the development and utilization of key performance indicators (KPIs) for evaluating and measuring performance. This paper presents an innovative framework for KPI construction within performance evaluation systems, leveraging Random Forest algorithms and variable importance analysis. The proposed approach identifies key variables that significantly influence PA performance, offering valuable insights into the critical factors driving organizational success. By integrating variable importance analysis with expert consultation, relevant KPIs can be systematically developed, ensuring that improvement strategies address performance-critical areas. The framework incorporates continuous monitoring mechanisms and adaptive phases to refine KPIs in response to evolving administrative needs. This study aims to enhance PA performance through the application of machine learning techniques, fostering a more agile and results-driven approach to public administration.