boston housing data
Characterizing Deep Gaussian Processes via Nonlinear Recurrence Systems
Recent advances in Deep Gaussian Processes (DGPs) show the potential to have more expressive representation than that of traditional Gaussian Processes (GPs). However, there exists a pathology of deep Gaussian processes that their learning capacities reduce significantly when the number of layers increases. In this paper, we present a new analysis in DGPs by studying its corresponding nonlinear dynamic systems to explain the issue. Existing work reports the pathology for the squared exponential kernel function. We extend our investigation to four types of common stationary kernel functions. The recurrence relations between layers are analytically derived, providing a tighter bound and the rate of convergence of the dynamic systems. We demonstrate our finding with a number of experimental results.
Explaining black box decisions by Shapley cohort refinement
Mase, Masayoshi, Owen, Art B., Seiler, Benjamin
Black box prediction models used in statistics, machine learning and artificial intelligence have been able to make increasingly accurate predictions, but it remains hard to understand those predictions. See for example, ห Strumbelj and Kononenko (2010, 2014), Ribeiro et al. (2016), Sundararajan and Najmi (2019) and the book of Molnar (2018). Part of understanding predictions is understanding which variables are important. A variable could be important because changing it makes a causal difference, or because changing it makes a large change to our predictions or because leaving it out of a model reduces that model's prediction accuracy (Jiang and Owen, 2003). Importance by one of these criteria need not imply importance by another, though additional assumptions may allow a causal implication to be made from one of the other measures (Pearl, 2009; Zhao and Hastie, 2019). We could be interested in variables that are important overall or in variables that explain one single prediction, such as why a given person was or was not approved for a loan, or why a given patient was or was not placed in an intensive care unit.
Tree Ensembles with Rule Structured Horseshoe Regularization
Nalenz, Malte, Villani, Mattias
We propose a new Bayesian model for flexible nonlinear regression and classification using tree ensembles. The model is based on the RuleFit approach in Friedman and Popescu (2008) where rules from decision trees and linear terms are used in a L1-regularized regression. We modify RuleFit by replacing the L1-regularization by a horseshoe prior, which is well known to give aggressive shrinkage of noise predictor while leaving the important signal essentially untouched. This is especially important when a large number of rules are used as predictors as many of them only contribute noise. Our horseshoe prior has an additional hierarchical layer that applies more shrinkage a priori to rules with a large number of splits, and to rules that are only satisfied by a few observations. The aggressive noise shrinkage of our prior also makes it possible to complement the rules from boosting in Friedman and Popescu (2008) with an additional set of trees from random forest, which brings a desirable diversity to the ensemble. We sample from the posterior distribution using a very efficient and easily implemented Gibbs sampler. The new model is shown to outperform state-of-the-art methods like RuleFit, BART and random forest on 16 datasets. The model and its interpretation is demonstrated on the well known Boston housing data, and on gene expression data for cancer classification. The posterior sampling, prediction and graphical tools for interpreting the model results are implemented in a publicly available R package.
Nonlinear Markov Networks for Continuous Variables
Hofmann, Reimar, Tresp, Volker
We address the problem oflearning structure in nonlinear Markov networks with continuous variables. This can be viewed as non-Gaussian multidimensional density estimation exploiting certain conditional independencies in the variables. Markov networks are a graphical way of describing conditional independencies well suited to model relationships which do not exhibit a natural causal ordering. We use neural network structures to model the quantitative relationships between variables. The main focus in this paper will be on learning the structure for the purpose of gaining insight into the underlying process. Using two data sets we show that interesting structures can be found using our approach. Inference will be briefly addressed.
Nonlinear Markov Networks for Continuous Variables
Hofmann, Reimar, Tresp, Volker
We address the problem oflearning structure in nonlinear Markov networks with continuous variables. This can be viewed as non-Gaussian multidimensional density estimation exploiting certain conditional independencies in the variables. Markov networks are a graphical way of describing conditional independencies well suited to model relationships which do not exhibit a natural causal ordering. We use neural network structures to model the quantitative relationships between variables. The main focus in this paper will be on learning the structure for the purpose of gaining insight into the underlying process. Using two data sets we show that interesting structures can be found using our approach. Inference will be briefly addressed.
Nonlinear Markov Networks for Continuous Variables
Hofmann, Reimar, Tresp, Volker
We address the problem oflearning structure in nonlinear Markov networks with continuous variables. This can be viewed as non-Gaussian multidimensional densityestimation exploiting certain conditional independencies in the variables. Markov networks are a graphical way of describing conditional independencieswell suited to model relationships which do not exhibit a natural causal ordering. We use neural network structures to model the quantitative relationships between variables.
Early Brain Damage
Tresp, Volker, Neuneier, Ralph, Zimmermann, Hans-Georg
Optimal Brain Damage (OBD) is a method for reducing the number of weights in a neural network. OBD estimates the increase in cost function if weights are pruned and is a valid approximation if the learning algorithm has converged into a local minimum. On the other hand it is often desirable to terminate the learning process before a local minimum is reached (early stopping). In this paper we show that OBD estimates the increase in cost function incorrectly if the network is not in a local minimum. We also show how OBD can be extended such that it can be used in connection with early stopping. We call this new approach Early Brain Damage, EBD. EBD also allows to revive already pruned weights. We demonstrate the improvements achieved by EBD using three publicly available data sets.
Early Brain Damage
Tresp, Volker, Neuneier, Ralph, Zimmermann, Hans-Georg
Optimal Brain Damage (OBD) is a method for reducing the number of weights in a neural network. OBD estimates the increase in cost function if weights are pruned and is a valid approximation if the learning algorithm has converged into a local minimum. On the other hand it is often desirable to terminate the learning process before a local minimum is reached (early stopping). In this paper we show that OBD estimates the increase in cost function incorrectly if the network is not in a local minimum. We also show how OBD can be extended such that it can be used in connection with early stopping. We call this new approach Early Brain Damage, EBD. EBD also allows to revive already pruned weights. We demonstrate the improvements achieved by EBD using three publicly available data sets.
Early Brain Damage
Tresp, Volker, Neuneier, Ralph, Zimmermann, Hans-Georg
Optimal Brain Damage (OBD) is a method for reducing the number ofweights in a neural network. OBD estimates the increase in cost function if weights are pruned and is a valid approximation if the learning algorithm has converged into a local minimum. On the other hand it is often desirable to terminate the learning process beforea local minimum is reached (early stopping). In this paper we show that OBD estimates the increase in cost function incorrectly if the network is not in a local minimum. We also show how OBD can be extended such that it can be used in connection withearly stopping.