Variable selection for high-dimensional linear models has received a lot of attention lately, mostly in the context of l1-regularization. Part of the attraction is the variable selection effect: parsimonious models are obtained, which are very suitable for interpretation. In terms of predictive power, however, these regularized linear models are often slightly inferior to machine learning procedures like tree ensembles. Tree ensembles, on the other hand, lack usually a formal way of variable selection and are difficult to visualize. A Garrote-style convex penalty for trees ensembles, in particular Random Forests, is proposed. The penalty selects functional groups of nodes in the trees. These could be as simple as monotone functions of individual predictor variables. This yields a parsimonious function fit, which lends itself easily to visualization and interpretation. The predictive power is maintained at least at the same level as the original tree ensemble. A key feature of the method is that, once a tree ensemble is fitted, no further tuning parameter needs to be selected. The empirical performance is demonstrated on a wide array of datasets.
We discuss the relevance of the recent Machine Learning (ML) literature for economics and econometrics. First we discuss the differences in goals, methods and settings between the ML literature and the traditional econometrics and statistics literatures. Then we discuss some specific methods from the machine learning literature that we view as important for empirical researchers in economics. These include supervised learning methods for regression and classification, unsupervised learning methods, as well as matrix completion methods. Finally, we highlight newly developed methods at the intersection of ML and econometrics, methods that typically perform better than either off-the-shelf ML or more traditional econometric methods when applied to particular classes of problems, problems that include causal inference for average treatment effects, optimal policy estimation, and estimation of the counterfactual effect of price changes in consumer choice models.
Random forests are a powerful method for non-parametric regression, but are limited in their ability to fit smooth signals, and can show poor predictive performance in the presence of strong, smooth effects. Taking the perspective of random forests as an adaptive kernel method, we pair the forest kernel with a local linear regression adjustment to better capture smoothness. The resulting procedure, local linear forests, enables us to improve on asymptotic rates of convergence for random forests with smooth signals, and provides substantial gains in accuracy on both real and simulated data.
Technology and collaboration enable dramatic increases in the size of psychological and psychiatric data collections, but finding structure in these large data sets with many collected variables is challenging. Decision tree ensembles like random forests (Strobl, Malley, and Tutz, 2009) are a useful tool for finding structure, but are difficult to interpret with multiple outcome variables which are often of interest in psychology. To find and interpret structure in data sets with multiple outcomes and many predictors (possibly exceeding the sample size), we introduce a multivariate extension to a decision tree ensemble method called Gradient Boosted Regression Trees (Friedman, 2001). Our method, multivariate tree boosting, can be used for identifying important predictors, detecting predictors with non-linear effects and interactions without specification of such effects, and for identifying predictors that cause two or more outcome variables to covary without parametric assumptions. We provide the R package 'mvtboost' to estimate, tune, and interpret the resulting model, which extends the implementation of univariate boosting in the R package 'gbm' (Ridgeway, 2013) to continuous, multivariate outcomes. To illustrate the approach, we analyze predictors of psychological well-being (Ryff and Keyes, 1995). Simulations verify that our approach identifies predictors with non-linear effects and achieves high prediction accuracy, exceeding or matching the performance of (penalized) multivariate multiple regression and multivariate decision trees over a wide range of conditions.
The problem of subgroups is ubiquitous in scientific research (ex. disease heterogeneity, spatial distributions in ecology...), and piecewise regression is one way to deal with this phenomenon. Morse-Smale regression offers a way to partition the regression function based on level sets of a defined function and that function's basins of attraction. This topologically-based piecewise regression algorithm has shown promise in its initial applications, but the current implementation in the literature has been limited to elastic net and generalized linear regression. It is possible that nonparametric methods, such as random forest or conditional inference trees, may provide better prediction and insight through modeling interaction terms and other nonlinear relationships between predictors and a given outcome. This study explores the use of several machine learning algorithms within a Morse-Smale piecewise regression framework, including boosted regression with linear baselearners, homotopy-based LASSO, conditional inference trees, random forest, and a wide neural network framework called extreme learning machines. Simulations on Tweedie regression problems with varying Tweedie parameter and dispersion suggest that many machine learning approaches to Morse-Smale piecewise regression improve the original algorithm's performance, particularly for outcomes with lower dispersion and linear or a mix of linear and nonlinear predictor relationships. On a real actuarial problem, several of these new algorithms perform as good as or better than the original Morse-Smale regression algorithm, and most provide information on the nature of predictor relationships within each partition to provide insight into differences between dataset partitions.