Verdinelli, Isabella
Feature Importance: A Closer Look at Shapley Values and LOCO
Verdinelli, Isabella, Wasserman, Larry
There is much interest lately in explainability in statistics and machine learning. One aspect of explainability is to quantify the importance of various features (or covariates). Two popular methods for defining variable importance are LOCO (Leave Out COvariates) and Shapley Values. We take a look at the properties of these methods and their advantages and disadvantages. We are particularly interested in the effect of correlation between features which can obscure interpretability. Contrary to some claims, Shapley values do not eliminate feature correlation. We critique the game theoretic axioms for Shapley values and suggest some new axioms. We propose new, more statistically oriented axioms for feature importance and some measures that satisfy these axioms. However, correcting for correlation is a Faustian bargain: removing the effect of correlation creates other forms of bias. Ultimately, we recommend a slightly modified version of LOCO. We briefly consider how to modify Shapley values to better address feature correlation.
Decorrelated Variable Importance
Verdinelli, Isabella, Wasserman, Larry
Because of the widespread use of black box prediction methods such as random forests and neural nets, there is renewed interest in developing methods for quantifying variable importance as part of the broader goal of interpretable prediction. A popular approach is to define a variable importance parameter - known as LOCO (Leave Out COvariates) - based on dropping covariates from a regression model. This is essentially a nonparametric version of R-squared. This parameter is very general and can be estimated nonparametrically, but it can be hard to interpret because it is affected by correlation between covariates. We propose a method for mitigating the effect of correlation by defining a modified version of LOCO. This new parameter is difficult to estimate nonparametrically, but we show how to estimate it using semiparametric models.
Forest Guided Smoothing
Verdinelli, Isabella, Wasserman, Larry
Random forests are often an accurate method for nonparametric regression but they are notoriously difficult to interpret. Also, it is difficult to construct standard errors, confidence intervals and meaningful measures of variable importance. In this paper, we construct a spatially adaptive local linear smoother that approximates the forest. Our approach builds on the ideas in Bloniarz et al. (2016) and Friedberg et al. (2020). The main difference is that we define a one parameter family of bandwidth matrices which help with the construction of confidence intervals, and measures of variable importance. Our starting point is the well-known fact that a random forest can be regarded as a type of kernel smoother (Breiman (2000); Scornet (2016); Lin and Jeon (2006); Geurts et al. (2006); Hothorn et al. (2004); Meinshausen (2006)). We take it as a given that the forest is an accurate predictor and we do not make any attempt to improve the method. Instead, we want to find a family of linear smoothers that approximate the forest. Then we show how to use this family for interpretation, bias correction, confidence intervals, variable importance and for exploring the structure of the forest.
Finding Singular Features
Genovese, Christopher, Perone-Pacifico, Marco, Verdinelli, Isabella, Wasserman, Larry
Nonparametric ridge estimation
Genovese, Christopher R., Perone-Pacifico, Marco, Verdinelli, Isabella, Wasserman, Larry
We study the problem of estimating the ridges of a density function. Ridge estimation is an extension of mode finding and is useful for understanding the structure of a density. It can also be used to find hidden structure in point cloud data. We show that, under mild regularity conditions, the ridges of the kernel density estimator consistently estimate the ridges of the true density. When the data are noisy measurements of a manifold, we show that the ridges are close and topologically similar to the hidden manifold. To find the estimated ridges in practice, we adapt the modified mean-shift algorithm proposed by Ozertem and Erdogmus [J. Mach. Learn. Res. 12 (2011) 1249-1286]. Some numerical experiments verify that the algorithm is accurate.
Minimax Manifold Estimation
Genovese, Christopher, Perone-Pacifico, Marco, Verdinelli, Isabella, Wasserman, Larry
We find the minimax rate of convergence in Hausdorff distance for estimating a manifold M of dimension d embedded in R^D given a noisy sample from the manifold. We assume that the manifold satisfies a smoothness condition and that the noise distribution has compact support. We show that the optimal rate of convergence is n^{-2/(2+d)}. Thus, the minimax rate depends only on the dimension of the manifold, not on the dimension of the space in which M is embedded.