Goto

Collaborating Authors

 continuous outcome


Random Forests as Statistical Procedures: Design, Variance, and Dependence

O'Connell, Nathaniel S.

arXiv.org Machine Learning

We develop a finite-sample, design-based theory for random forests in which each tree is a randomized conditional predictor acting on fixed covariates and the forest is their Monte Carlo average. An exact variance identity separates Monte Carlo error from a covariance floor that persists under infinite aggregation. The floor arises through two mechanisms: observation reuse, where the same training outcomes receive weight across multiple trees, and partition alignment, where independently generated trees discover similar conditional prediction rules. We prove the floor is strictly positive under minimal conditions and show that alignment persists even when sample splitting eliminates observation overlap entirely. We introduce procedure-aligned synthetic resampling (PASR) to estimate the covariance floor, decomposing the total prediction uncertainty of a deployed forest into interpretable components. For continuous outcomes, resulting prediction intervals achieve nominal coverage with a theoretically guaranteed conservative bias direction. For classification forests, the PASR estimator is asymptotically unbiased, providing the first pointwise confidence intervals for predicted conditional probabilities from a deployed forest. Nominal coverage is maintained across a range of design configurations for both outcome types, including high-dimensional settings. The underlying theory extends to any tree-based ensemble with an exchangeable tree-generating mechanism.


Partial Counterfactual Identification of Continuous Outcomes with a Curvature Sensitivity Model

Neural Information Processing Systems

Counterfactual inference aims to answer retrospective what if questions and thus belongs to the most fine-grained type of inference in Pearl's causality ladder. Existing methods for counterfactual inference with continuous outcomes aim at point identification and thus make strong and unnatural assumptions about the underlying structural causal model. In this paper, we relax these assumptions and aim at partial counterfactual identification of continuous outcomes, i.e., when the counterfactual query resides in an ignorance interval with informative bounds. We prove that, in general, the ignorance interval of the counterfactual queries has non-informative bounds, already when functions of structural causal models are continuously differentiable. As a remedy, we propose a novel sensitivity model called Curvature Sensitivity Model.


Partial Counterfactual Identification of Continuous Outcomes with a Curvature Sensitivity Model

Neural Information Processing Systems

Counterfactual inference aims to answer retrospective "what if" questions and thus belongs to the most fine-grained type of inference in Pearl's causality ladder. Existing methods for counterfactual inference with continuous outcomes aim at point identification and thus make strong and unnatural assumptions about the underlying structural causal model. In this paper, we relax these assumptions and aim at partial counterfactual identification of continuous outcomes, i.e., when the counterfactual query resides in an ignorance interval with informative bounds. We prove that, in general, the ignorance interval of the counterfactual queries has non-informative bounds, already when functions of structural causal models are continuously differentiable. As a remedy, we propose a novel sensitivity model called Curvature Sensitivity Model.


rmlnomogram: An R package to construct an explainable nomogram for any machine learning algorithms

Sufriyana, Herdiantri, Su, Emily Chia-Yu

arXiv.org Machine Learning

Background: Current nomogram can only be created for regression algorithm. Providing nomogram for any machine learning (ML) algorithms may accelerate model deployment in clinical settings or improve model availability. We developed an R package and web application to construct nomogram with model explainability of any ML algorithms. Methods: We formulated a function to transform an ML prediction model into a nomogram, requiring datasets with: (1) all possible combinations of predictor values; (2) the corresponding outputs of the model; and (3) the corresponding explainability values for each predictor (optional). Web application was also created. Results: Our R package could create 5 types of nomograms for categorical predictors and binary outcome without probability (1), categorical predictors and binary outcome with probability (2) or continuous outcome (3), and categorical with single numerical predictors and binary outcome with probability (4) or continuous outcome (5). Respectively, the first and remaining types optimally allowed maximum 15 and 5 predictors with maximum 3,200 combinations. Web application is provided with such limits. The explainability values were possible for types 2 to 5. Conclusions: Our R package and web application could construct nomogram with model explainability of any ML algorithms using a fair number of predictors.


d64a340bcb633f536d56e51874281454-Reviews.html

Neural Information Processing Systems

Summary of the paper This paper is concerned with the support recovery problem in linear regression in the high dimensional setup, that is to say, recovery of the non null entries in the vector of regression parameters when the number of predictors p exceeds the sample size n. A simple greedy algorithm is proposed, particularly suitable in presence of high correlation between predictors: starting from an initial guess S of the true support, it swaps each of the variables in S to each of the variables in S c, one at a time, looking for improvement in the square loss. Such a procedure is called SWAP, and is typically used to enhance the performances of classical sparse recovery algorithms such as the LASSO, by using the latter as the initial guess for S. A theoretical analysis describing limitations and guarantees of SWAP are exposed in details: conditions for accurate support recovery and bounds for the number of iterations required are provided. It is shown that the required assumptions are milder than the usual irrepresentability condition. Comments The paper is clearly written and pleasant to read.


Scalable Randomized Kernel Methods for Multiview Data Integration and Prediction

Safo, Sandra E., Lu, Han

arXiv.org Machine Learning

We develop scalable randomized kernel methods for jointly associating data from multiple sources and simultaneously predicting an outcome or classifying a unit into one of two or more classes. The proposed methods model nonlinear relationships in multiview data together with predicting a clinical outcome and are capable of identifying variables or groups of variables that best contribute to the relationships among the views. We use the idea that random Fourier bases can approximate shift-invariant kernel functions to construct nonlinear mappings of each view and we use these mappings and the outcome variable to learn view-independent low-dimensional representations. Through simulation studies, we show that the proposed methods outperform several other linear and nonlinear methods for multiview data integration. When the proposed methods were applied to gene expression, metabolomics, proteomics, and lipidomics data pertaining to COVID-19, we identified several molecular signatures forCOVID-19 status and severity. Results from our real data application and simulations with small sample sizes suggest that the proposed methods may be useful for small sample size problems. Availability: Our algorithms are implemented in Pytorch and interfaced in R and would be made available at: https://github.com/lasandrall/RandMVLearn.


Random Forests for time-fixed and time-dependent predictors: The DynForest R package

Devaux, Anthony, Proust-Lima, Cécile, Genuer, Robin

arXiv.org Artificial Intelligence

The R package DynForest implements random forests for predicting a categorical or a (multiple causes) time-to-event outcome based on time-fixed and time-dependent predictors. Through the random forests, the time-dependent predictors can be measured with error at subject-specific times, and they can be endogeneous (i.e., impacted by the outcome process). They are modeled internally using flexible linear mixed models (thanks to lcmm package) with time-associations pre-specified by the user. DynForest computes dynamic predictions that take into account all the information from time-fixed and time-dependent predictors. DynForest also provides information about the most predictive variables using variable importance and minimal depth. Variable importance can also be computed on groups of variables. To display the results, several functions are available such as summary and plot functions. This paper aims to guide the user with a step-by-step example of the different functions for fitting random forests within DynForest.


Fair Generalized Linear Models with a Convex Penalty

Do, Hyungrok, Putzel, Preston, Martin, Axel, Smyth, Padhraic, Zhong, Judy

arXiv.org Machine Learning

Despite recent advances in algorithmic fairness, To address these issues there has recently been a significant methodologies for achieving fairness with generalized body of work in the machine learning community on linear models (GLMs) have yet to be algorithmic fairness in the context of predictive modeling, explored in general, despite GLMs being widely including (i) data preprocessing methods that try to reduce used in practice. In this paper we introduce two disparities, (ii) in-process approaches which enforce fairness fairness criteria for GLMs based on equalizing during model training, and (iii) post-process approaches expected outcomes or log-likelihoods. We prove which adjust a model's predictions to achieve fairness after that for GLMs both criteria can be achieved via training is completed. However, the majority of this work a convex penalty term based solely on the linear has focused on classification problems with binary outcome components of the GLM, thus permitting efficient variables, and to a lesser extent on regression.


When regression coefficients change over time: A proposal

Schierholz, Malte

arXiv.org Machine Learning

A common approach in forecasting problems is to estimate a least-squares regression (or other statistical learning models) from past data, which is then applied to predict future outcomes. An underlying assumption is that the same correlations that were observed in the past still hold for the future. We propose a model for situations when this assumption is not met: adopting methods from the state space literature, we model how regression coefficients change over time. Our approach can shed light on the large uncertainties associated with forecasting the future, and how much of this is due to changing dynamics of the past. Our simulation study shows that accurate estimates are obtained when the outcome is continuous, but the procedure fails for binary outcomes.


A generalised OMP algorithm for feature selection with application to gene expression data

Tsagris, Michail, Papadovasilakis, Zacharias, Lakiotaki, Kleanthi, Tsamardinos, Ioannis

arXiv.org Machine Learning

Feature selection for predictive analytics is the problem of identifying a minimal-size subset of features that is maximally predictive of an outcome of interest. To apply to molecular data, feature selection algorithms need to be scalable to tens of thousands of available features. In this paper, we propose gOMP, a highly-scalable generalisation of the Orthogonal Matching Pursuit feature selection algorithm to several directions: (a) different types of outcomes, such as continuous, binary, nominal, and time-to-event, (b) different types of predictive models (e.g., linear least squares, logistic regression), (c) different types of predictive features (continuous, categorical), and (d) different, statistical-based stopping criteria. We compare the proposed algorithm against LASSO, a prototypical, widely used algorithm for high-dimensional data. On dozens of simulated datasets, as well as, real gene expression datasets, gOMP is on par, or outperforms LASSO for case-control binary classification, quantified outcomes (regression), and (censored) survival times (time-to-event) analysis. gOMP has also several theoretical advantages that are discussed. While gOMP is based on quite simple and basic statistical ideas, easy to implement and to generalize, we also show in an extensive evaluation that it is also quite effective in bioinformatics analysis settings.