regression forest
Assessing Surrogate Heterogeneity in Real World Data Using Meta-Learners
Knowlton, Rebecca, Parast, Layla
Surrogate markers are most commonly studied within the context of randomized clinical trials. However, the need for alternative outcomes extends beyond these settings and may be more pronounced in real-world public health and social science research, where randomized trials are often impractical. Research on identifying surrogates in real-world non-randomized data is scarce, as available statistical approaches for evaluating surrogate markers tend to rely on the assumption that treatment is randomized. While the few methods that allow for non-randomized treatment/exposure appropriately handle confounding individual characteristics, they do not offer a way to examine surrogate heterogeneity with respect to patient characteristics. In this paper, we propose a framework to assess surrogate heterogeneity in real-world, i.e., non-randomized, data and implement this framework using various meta-learners. Our approach allows us to quantify heterogeneity in surrogate strength with respect to patient characteristics while accommodating confounders through the use of flexible, off-the-shelf machine learning methods. In addition, we use our framework to identify individuals for whom the surrogate is a valid replacement of the primary outcome. We examine the performance of our methods via a simulation study and application to examine heterogeneity in the surrogacy of hemoglobin A1c as a surrogate for fasting plasma glucose.
Uncertainty estimation in satellite precipitation spatial prediction by combining distributional regression algorithms
Papacharalampous, Georgia, Tyralis, Hristos, Doulamis, Nikolaos, Doulamis, Anastasios
To facilitate effective decision-making, gridded satellite precipitation products should include uncertainty estimates. Machine learning has been proposed for issuing such estimates. However, most existing algorithms for this purpose rely on quantile regression. Distributional regression offers distinct advantages over quantile regression, including the ability to model intermittency as well as a stronger ability to extrapolate beyond the training data, which is critical for predicting extreme precipitation. In this work, we introduce the concept of distributional regression for the engineering task of creating precipitation datasets through data merging. Building upon this concept, we propose new ensemble learning methods that can be valuable not only for spatial prediction but also for prediction problems in general. These methods exploit conditional zero-adjusted probability distributions estimated with generalized additive models for location, scale, and shape (GAMLSS), spline-based GAMLSS and distributional regression forests as well as their ensembles (stacking based on quantile regression, and equal-weight averaging). To identify the most effective methods for our specific problem, we compared them to benchmarks using a large, multi-source precipitation dataset. Stacking emerged as the most successful strategy. Three specific stacking methods achieved the best performance based on the quantile scoring rule, although the ranking of these methods varied across quantile levels. This suggests that a task-specific combination of multiple algorithms could yield significant benefits.
Stacking for Probabilistic Short-term Load Forecasting
In this study, we delve into the realm of meta-learning to combine point base forecasts for probabilistic short-term electricity demand forecasting. Our approach encompasses the utilization of quantile linear regression, quantile regression forest, and post-processing techniques involving residual simulation to generate quantile forecasts. Furthermore, we introduce both global and local variants of meta-learning. In the local-learning mode, the meta-model is trained using patterns most similar to the query pattern. Through extensive experimental studies across 35 forecasting scenarios and employing 16 base forecasting models, our findings underscored the superiority of quantile regression forest over its competitors.
Efficient Normalized Conformal Prediction and Uncertainty Quantification for Anti-Cancer Drug Sensitivity Prediction with Deep Regression Forests
Nolte, Daniel, Ghosh, Souparno, Pal, Ranadip
Deep learning models are being adopted and applied on various critical decision-making tasks, yet they are trained to provide point predictions without providing degrees of confidence. The trustworthiness of deep learning models can be increased if paired with uncertainty estimations. Conformal Prediction has emerged as a promising method to pair machine learning models with prediction intervals, allowing for a view of the model's uncertainty. However, popular uncertainty estimation methods for conformal prediction fail to provide heteroskedastic intervals that are equally accurate for all samples. In this paper, we propose a method to estimate the uncertainty of each sample by calculating the variance obtained from a Deep Regression Forest. We show that the deep regression forest variance improves the efficiency and coverage of normalized inductive conformal prediction on a drug response prediction task.
Example-Based Explanations of Random Forest Predictions
A random forest prediction can be computed by the scalar product of the labels of the training examples and a set of weights that are determined by the leafs of the forest into which the test object falls; each prediction can hence be explained exactly by the set of training examples for which the weights are non-zero. The number of examples used in such explanations is shown to vary with the dimensionality of the training set and hyperparameters of the random forest algorithm. This means that the number of examples involved in each prediction can to some extent be controlled by varying these parameters. However, for settings that lead to a required predictive performance, the number of examples involved in each prediction may be unreasonably large, preventing the user to grasp the explanations. In order to provide more useful explanations, a modified prediction procedure is proposed, which includes only the top-weighted examples. An investigation on regression and classification tasks shows that the number of examples used in each explanation can be substantially reduced while maintaining, or even improving, predictive performance compared to the standard prediction procedure.
Statistical post-processing of wind speed forecasts using convolutional neural networks
Veldkamp, Simon, Whan, Kirien, Dirksen, Sjoerd, Schmeits, Maurice
Current statistical post-processing methods for probabilistic weather forecasting are not capable of using full spatial patterns from the numerical weather prediction (NWP) model. In this paper we incorporate spatial wind speed information by using convolutional neural networks (CNNs) and obtain probabilistic wind speed forecasts in the Netherlands for 48 hours ahead, based on KNMI's Harmonie-Arome NWP model. The CNNs are shown to have higher Brier skill scores for medium to higher wind speeds, as well as a better continuous ranked probability score (CRPS), than fully connected neural networks and quantile regression forests.
Sufficient Representations for Categorical Variables
Johannemann, Jonathan, Hadad, Vitor, Athey, Susan, Wager, Stefan
Many learning algorithms require categorical data to be transformed into real vectors before it can be used as input. Often, categorical variables are encoded as one-hot (or dummy) vectors. However, this mode of representation can be wasteful since it adds many low-signal regressors, especially when the number of unique categories is large. In this paper, we investigate simple alternative solutions for universally consistent estimators that rely on lower-dimensional real-valued representations of categorical variables that are "sufficient" in the sense that no predictive information is lost. We then compare preexisting and proposed methods on simulated and observational datasets.
XGBoostLSS -- An extension of XGBoost to probabilistic forecasting
We propose a new framework of XGBoost that predicts the entire conditional distribution of a univariate response variable. In particular, XGBoostLSS models all moments of a parametric distribution, i.e., mean, location, scale and shape (LSS), instead of the conditional mean only. Choosing from a wide range of continuous, discrete and mixed discrete-continuous distribution, modelling and predicting the entire conditional distribution greatly enhances the flexibility of XGBoost, as it allows to gain additional insight into the data generating process, as well as to create probabilistic forecasts from which prediction intervals and quantiles of interest can be derived. We present both a simulation study and real world examples that demonstrate the benefits of our approach.
Quantitative Error Prediction of Medical Image Registration using Regression Forests
Sokooti, Hessam, Saygili, Gorkem, Glocker, Ben, Lelieveldt, Boudewijn P. F., Staring, Marius
Predicting registration error can be useful for evaluation of registration procedures, which is important for the adoption of registration techniques in the clinic. In addition, quantitative error prediction can be helpful in improving the registration quality. The task of predicting registration error is demanding due to the lack of a ground truth in medical images. This paper proposes a new automatic method to predict the registration error in a quantitative manner, and is applied to chest CT scans. A random regression forest is utilized to predict the registration error locally. The forest is built with features related to the transformation model and features related to the dissimilarity after registration. The forest is trained and tested using manually annotated corresponding points between pairs of chest CT scans in two experiments: SPREAD (trained and tested on SPREAD) and inter-database (including three databases SPREAD, DIR-Lab-4DCT and DIR-Lab-COPDgene). The results show that the mean absolute errors of regression are 1.07 $\pm$ 1.86 and 1.76 $\pm$ 2.59 mm for the SPREAD and inter-database experiment, respectively. The overall accuracy of classification in three classes (correct, poor and wrong registration) is 90.7% and 75.4%, for SPREAD and inter-database respectively. The good performance of the proposed method enables important applications such as automatic quality control in large-scale image analysis.
Deep Distribution Regression
Li, Rui, Bondell, Howard D., Reich, Brian J.
In recent years, a variety of machine learning methods, such as random forest, gradient boosting trees and neural networks have gained popularity and been widely adopted. These methods are often flexible enough to uncover complex relationships in high-dimensional data without strong assumptions on the underlying data structure. Off-the-shelf software is available to put these algorithms into production [Pedregosa et al. (2011), Abadi et al. (2016) and Paszke et al. (2017)]. However, in regression and forecasting tasks, many of the machine learning methods only provide a point estimate, without any additional information regarding the uncertainty of the target quantity. Understanding uncertainties are often crucial in fields such as financial markets and risk analysis [Diebold et al. (1997), Timmermann (2000)], population and demographic studies [Wilson and Bell (2007)], transportation and traffic analysis [Zhu and Laptev (2017), Rodrigues and Pereira (2018)] and energy forecasting [Hong et al. (2016)].