Decision Tree Learning
Supplementary Material for Classification with Valid and Adaptive Coverage Y aniv Romano
Here, we consider the jackknife+--i.e., Algorithm S1 describes the extension of Algorithm 1 discussed in Section 2.5, which ensures The validity of this algorithm is established by the following result. We begin by proving the lower bound on coverage. This will become apparent after we reduce our claim to the setting in the aforementioned paper. This is easy to verify. Let ฯ (1),...,ฯ ( n + m) be the permutation of the data points corresponding to ฮฃ, so that (ฮฃA ฮฃ S3.1 Implementation details We have applied the following black-box classification methods to estimate label probabilities: JK+ is omitted for computational reasons. The performances of the different methods on data generated from this model are compared in Figure S3.
Interpretable Machine Learning for Life Expectancy Prediction: A Comparative Study of Linear Regression, Decision Tree, and Random Forest
Dolgopolyi, Roman, Amaslidou, Ioanna, Margaritou, Agrippina
Life expectancy is a fundamental indicator of population health and socio-economic well-being, yet accurately forecasting it remains challenging due to the interplay of demographic, environmental, and healthcare factors. Thi s study evaluates three machine learning models--Linear Regression (LR), Regression Decision Tree (RDT), and Random Forest (RF), using a real -world da-taset drawn from World Health Organization (WHO) and United N ations (UN) sources. After extensive preprocessing to address missing v alues and inconsistencies, each model's performance was assessed with R, Mean Absolute Error (MAE), and Root Mean Squared Error (RMSE). Results show tha t RF achieves the highest predictive accuracy (R = 0.9423), significantly outperforming LR and RDT. Interpretability was prioritized through p -values for LR and feature - importance metrics for the tree -based models, revealing immunization rates (diphtheria, measles) and demographic attributes (HIV/AIDS, adult mortality) as critical drivers of life-expectancy predictions. These insights underscore the synergy between ensemble methods and transparency in addressing public -health challenges. Future research should explore advanced imputation strategies, alternative algorithms (e.g., neural networks), and updated data to further refine predictive accuracy and support evidence-based policymaking in global health contexts.
Mondrian Forests: Efficient Online Random Forests
Ensembles of randomized decision trees, usually referred to as random forests, are widely used for classification and regression tasks in machine learning and statistics. Random forests achieve competitive predictive performance and are computationally efficient to train and test, making them excellent candidates for real-world prediction tasks. The most popular random forest variants (such as Breiman's random forest and extremely randomized trees) operate on batches of training data. Online methods are now in greater demand. Existing online random forests, however, require more training data than their batch counterpart to achieve comparable predictive performance. In this work, we use Mondrian processes (Roy and Teh, 2009) to construct ensembles of random decision trees we call Mondrian forests. Mondrian forests can be grown in an incremental/online fashion and remarkably, the distribution of online Mondrian forests is the same as that of batch Mondrian forests. Mondrian forests achieve competitive predictive performance comparable with existing online random forests and periodically re-trained batch random forests, while being more than an order of magnitude faster, thus representing a better computation vs accuracy tradeoff.
Localized Uncertainty Quantification in Random Forests via Proximities
Rhodes, Jake S., Brown, Scott D., Wilkinson, J. Riley
Abstract--In machine learning, uncertainty quantification helps assess the reliability of model predictions, which is important in high-stakes scenarios. Traditional approaches often emphasize predictive accuracy, but there is a growing focus on incorporating uncertainty measures. While current methods often rely on quantile regression or Monte Carlo techniques, we propose a new approach using naturally occurring test sets and similarity measures (proximities) typically viewed as byproducts of random forests. Specifically, we form localized distributions of OOB errors around nearby points, defined using the proximities, to create prediction intervals for regression and trust scores for classification. By varying the number of nearby points, our intervals can be adjusted to achieve the desired coverage while retaining the flexibility that reflects the certainty of individual predictions. For classification, excluding points identified as unclassifiable by our method generally enhances the accuracy of the model and provides higher accuracy-rejection AUC scores than competing methods. Although traditional machine learning models usually provide point estimates, there is growing recognition of the need to incorporate uncertainty to support more informed decisions [1]. By quantifying uncertainty, users can assess the reliability of model outputs and better interpret results, especially for out-of-distribution samples through calibrated confidence estimates.
SHAPoint: Task-Agnostic, Efficient, and Interpretable Point-Based Risk Scoring via Shapley Values
Meirman, Tomer D., Shapira, Bracha, Dagan, Noa, Rokach, Lior S.
Interpretable risk scores play a vital role in clinical decision support, yet traditional methods for deriving such scores often rely on manual preprocessing, task-specific modeling, and simplified assumptions that limit their flexibility and predictive power. We present SHAPoint, a novel, task-agnostic framework that integrates the predictive accuracy of gradient boosted trees with the interpretability of point-based risk scores. SHAPoint supports classification, regression, and survival tasks, while also inheriting valuable properties from tree-based models, such as native handling of missing data and support for monotonic constraints. Compared to existing frameworks, SHAPoint offers superior flexibility, reduced reliance on manual preprocessing, and faster runtime performance. Empirical results show that SHAPoint produces compact and interpretable scores with predictive performance comparable to state-of-the-art methods, but at a fraction of the runtime, making it a powerful tool for transparent and scalable risk stratification.