Stability selection for component-wise gradient boosting in multiple dimensions

Thomas, Janek, Mayr, Andreas, Bischl, Bernd, Schmid, Matthias, Smith, Adam, Hofner, Benjamin

arXiv.org Machine Learning 

Noname manuscript No. (will be inserted by the editor) Abstract We present a new algorithm for boosting generalized additive models for location, scale and shape (GAMLSS) that allows to incorporate stability selection, an increasingly popular way to obtain stable sets of covariates while controlling the per-family error rate (PFER). The model is fitted repeatedly to subsampled data and variables with high selection frequencies are extracted. To apply stability selection to boosted GAMLSS, we develop a new "noncyclical" fitting algorithm that incorporates an additional selection step of the best-fitting distribution parameter in each iteration. This new algorithms has the additional advantage that optimizing the tuning parameters of boosting is reduced from a multidimensional to a one-dimensional problem with vastly decreased complexity. The performance of the novel algorithm is evaluated in an extensive simulation study. We apply this new algorithm to a study to estimate abundance of common eider in Massachusetts, USA, featuring excess zeros, overdispersion, non-linearity and spatiotemporal structures. Stability selection is used to obtain a sparse set of stable predictors. Keywords boosting · additive models · GAMLSS · gamboostLSS · Stability selection 1 Introduction In view of the growing size and complexity of modern databases, statistical modeling is increasingly faced with heteroscedasticity issues and a large number of available modeling options. In ecology, for example, it is often observed that outcome variables do not only show differences in mean conditions but also tend to be highly variable across different geographical features or states of a combination of covariates (e.g., [33]). In addition, ecological databases typically contain large numbers of correlated predictor variables that need to be carefully chosen for possible incorporation in a statistical regression model [1,8,31]. A convenient approach to address both heteroscedasticity and variable selection in statistical regression models is the combination of GAMLSS modeling with gradient boosting algorithms. GAMLSS, which refer to "generalized additive models for location, scale and shape" [34], are a modeling technique that relates not only the mean but all parameters of the outcome distribution to the available covariates.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found