Goto

Collaborating Authors

 Decision Tree Learning


Optimal Classification Trees for Continuous Feature Data Using Dynamic Programming with Branch-and-Bound

arXiv.org Artificial Intelligence

Computing an optimal classification tree that provably maximizes training performance within a given size limit, is NP-hard, and in practice, most state-of-the-art methods do not scale beyond computing optimal trees of depth three. Therefore, most methods rely on a coarse binarization of continuous features to maintain scalability. We propose a novel algorithm that optimizes trees directly on the continuous feature data using dynamic programming with branch-and-bound. We develop new pruning techniques that eliminate many sub-optimal splits in the search when similar to previously computed splits and we provide an efficient subroutine for computing optimal depth-two trees. Our experiments demonstrate that these techniques improve runtime by one or more orders of magnitude over state-of-the-art optimal methods and improve test accuracy by 5% over greedy heuristics.


Learning Hyperplane Tree: A Piecewise Linear and Fully Interpretable Decision-making Framework

arXiv.org Artificial Intelligence

This paper introduces a novel tree-based model, Learning Hyperplane Tree (LHT), which outperforms state-of-the-art (SOTA) tree models for classification tasks on several public datasets. The structure of LHT is simple and efficient: it partitions the data using several hyperplanes to progressively distinguish between target and non-target class samples. Although the separation is not perfect at each stage, LHT effectively improves the distinction through successive partitions. During testing, a sample is classified by evaluating the hyperplanes defined in the branching blocks and traversing down the tree until it reaches the corresponding leaf block. The class of the test sample is then determined using the piecewise linear membership function defined in the leaf blocks, which is derived through least-squares fitting and fuzzy logic. LHT is highly transparent and interpretable--at each branching block, the contribution of each feature to the classification can be clearly observed.


Privacy-Preserving Model and Preprocessing Verification for Machine Learning

arXiv.org Artificial Intelligence

This paper presents a framework for privacy-preserving verification of machine learning models, focusing on models trained on sensitive data. Integrating Local Differential Privacy (LDP) with model explanations from LIME and SHAP, our framework enables robust verification without compromising individual privacy. It addresses two key tasks: binary classification, to verify if a target model was trained correctly by applying the appropriate preprocessing steps, and multi-class classification, to identify specific preprocessing errors. Evaluations on three real-world datasets-Diabetes, Adult, and Student Record-demonstrate that while the ML-based approach is particularly effective in binary tasks, the threshold-based method performs comparably in multi-class tasks. Results indicate that although verification accuracy varies across datasets and noise levels, the framework provides effective detection of preprocessing errors, strong privacy guarantees, and practical applicability for safeguarding sensitive data.


Quantized Training of Gradient Boosting Decision Trees

Neural Information Processing Systems

Recent years have witnessed significant success in Gradient Boosting Decision Trees (GBDT) for a wide range of machine learning applications. Generally, a consensus about GBDT's training algorithms is gradients and statistics are computed based on high-precision floating points. In this paper, we investigate an essentially important question which has been largely ignored by the previous literature - how many bits are needed for representing gradients in training GBDT? To solve this mystery, we propose to quantize all the high-precision gradients in a very simple yet effective way in the GBDT's training algorithm. Surprisingly, both our theoretical analysis and empirical studies show that the necessary precisions of gradients without hurting any performance can be quite low, e.g., 2 or 3 bits.


Soft regression trees: a model variant and a decomposition training algorithm

arXiv.org Artificial Intelligence

Decision trees are widely used for classification and regression tasks in a variety of application fields due to their interpretability and good accuracy. During the past decade, growing attention has been devoted to globally optimized decision trees with deterministic or soft splitting rules at branch nodes, which are trained by optimizing the error function over all the tree parameters. In this work, we propose a new variant of soft multivariate regression trees (SRTs) where, for every input vector, the prediction is defined as the linear regression associated to a single leaf node, namely, the leaf node obtained by routing the input vector from the root along the branches with higher probability. SRTs exhibit the conditional computational property, i.e., each prediction depends on a small number of nodes (parameters), and our nonlinear optimization formulation for training them is amenable to decomposition. After showing a universal approximation result for SRTs, we present a decomposition training algorithm including a clustering-based initialization procedure and a heuristic for reassigning the input vectors along the tree. Under mild assumptions, we establish asymptotic convergence guarantees. Experiments on 15 wellknown datasets indicate that our SRTs and decomposition algorithm yield higher accuracy and robustness compared with traditional soft regression trees trained using the nonlinear optimization formulation of Blanquero et al., and a significant reduction in training times as well as a slightly better average accuracy compared with the mixed-integer optimization approach of Bertsimas and Dunn. We also report a comparison with the Random Forest ensemble method.


Mathematical Modeling and Machine Learning for Predicting Shade-Seeking Behavior in Cows Under Heat Stress

arXiv.org Artificial Intelligence

In this paper we develop a mathematical model combined with machine learning techniques to predict shade-seeking behavior in cows exposed to heat stress. The approach integrates advanced mathematical features, such as time-averaged thermal indices and accumulated heat stress metrics, obtained by mathematical analysis of data from a farm in Titaguas (Valencia, Spain), collected during the summer of 2023. Two predictive models, Random Forests and Neural Networks, are compared for accuracy, robustness, and interpretability. The Random Forest model is highlighted for its balance between precision and explainability, achieving an RMSE of $14.97$. The methodology also employs $5-$fold cross-validation to ensure robustness under real-world conditions. This work not only advances the mathematical modeling of animal behavior but also provides useful insights for mitigating heat stress in livestock through data-driven tools.


Towards understanding the bias in decision trees

arXiv.org Machine Learning

There is a widespread and longstanding belief that machine learning models are biased towards the majority (or negative) class when learning from imbalanced data, leading them to neglect or ignore the minority (or positive) class. In this study, we show that this belief is not necessarily correct for decision trees, and that their bias can actually be in the opposite direction. Motivated by a recent simulation study that suggested that decision trees can be biased towards the minority class, our paper aims to reconcile the conflict between that study and decades of other works. First, we critically evaluate past literature on this problem, finding that failing to consider the data generating process has led to incorrect conclusions about the bias in decision trees. We then prove that, under specific conditions related to the predictors, decision trees fit to purity and trained on a dataset with only one positive case are biased towards the minority class. Finally, we demonstrate that splits in a decision tree are also biased when there is more than one positive case. Our findings have implications on the use of popular tree-based models, such as random forests.


Machine learning applications in archaeological practices: a review

arXiv.org Artificial Intelligence

Artificial intelligence and machine learning applications in archaeology have increased significantly in recent years, and these now span all subfields, geographical regions, and time periods. The prevalence and success of these applications have remained largely unexamined, as recent reviews on the use of machine learning in archaeology have only focused only on specific subfields of archaeology. Our review examined an exhaustive corpus of 135 articles published between 1997 and 2022. We observed a significant increase in the number of relevant publications from 2019 onwards. Automatic structure detection and artefact classification were the most represented tasks in the articles reviewed, followed by taphonomy, and archaeological predictive modelling. From the review, clustering and unsupervised methods were underrepresented compared to supervised models. Artificial neural networks and ensemble learning account for two thirds of the total number of models used. However, if machine learning is gaining in popularity it remains subject to misunderstanding. We observed, in some cases, poorly defined requirements and caveats of the machine learning methods used. Furthermore, the goals and the needs of machine learning applications for archaeological purposes are in some cases unclear or poorly expressed. To address this, we proposed a workflow guide for archaeologists to develop coherent and consistent methodologies adapted to their research questions, project scale and data. As in many other areas, machine learning is rapidly becoming an important tool in archaeological research and practice, useful for the analyses of large and multivariate data, although not without limitations. This review highlights the importance of well-defined and well-reported structured methodologies and collaborative practices to maximise the potential of applications of machine learning methods in archaeology.


Can Explainable AI Assess Personalized Health Risks from Indoor Air Pollution?

arXiv.org Artificial Intelligence

Acknowledging the effects of outdoor air pollution, the literature inadequately addresses indoor air pollution's impacts. Despite daily health risks, existing research primarily focused on monitoring, lacking accuracy in pinpointing indoor pollution sources. In our research work, we thoroughly investigated the influence of indoor activities on pollution levels. A survey of 143 participants revealed limited awareness of indoor air pollution. Leveraging 65 days of diverse data encompassing activities like incense stick usage, indoor smoking, inadequately ventilated cooking, excessive AC usage, and accidental paper burning, we developed a comprehensive monitoring system. We identify pollutant sources and effects with high precision through clustering analysis and interpretability models (LIME and SHAP). Our method integrates Decision Trees, Random Forest, Naive Bayes, and SVM models, excelling at 99.8% accuracy with Decision Trees. Continuous 24-hour data allows personalized assessments for targeted pollution reduction strategies, achieving 91% accuracy in predicting activities and pollution exposure.


Strategic Fusion Optimizes Transformer Compression

arXiv.org Artificial Intelligence

This study investigates transformer model compression by systematically pruning its layers. We evaluated 14 pruning strategies across nine diverse datasets, including 12 strategies based on different signals obtained from layer activations, mutual information, gradients, weights, and attention. To address the limitations of single-signal strategies, we introduced two fusion strategies, linear regression and random forest, which combine individual strategies (i.e., strategic fusion), for more informed pruning decisions. Additionally, we applied knowledge distillation to mitigate any accuracy loss during layer pruning. Our results reveal that random forest strategic fusion outperforms individual strategies in seven out of nine datasets and achieves near-optimal performance in the other two. The distilled random forest surpasses the original accuracy in six datasets and mitigates accuracy drops in the remaining three. Knowledge distillation also improves the accuracy-to-size ratio by an average factor of 18.84 across all datasets. Supported by mathematical foundations and biological analogies, our findings suggest that strategically combining multiple signals can lead to efficient, high-performing transformer models for resource-constrained applications.