Matsubara, Takuo
Wasserstein Gradient Boosting: A General Framework with Applications to Posterior Regression
Matsubara, Takuo
Gradient boosting is a sequential ensemble method that fits a new base learner to the gradient of the remaining loss at each step. We propose a novel family of gradient boosting, Wasserstein gradient boosting, which fits a new base learner to an exactly or approximately available Wasserstein gradient of a loss functional on the space of probability distributions. Wasserstein gradient boosting returns a set of particles that approximates a target probability distribution assigned at each input. In probabilistic prediction, a parametric probability distribution is often specified on the space of output variables, and a point estimate of the output-distribution parameter is produced for each input by a model. Our main application of Wasserstein gradient boosting is a novel distributional estimate of the output-distribution parameter, which approximates the posterior distribution over the output-distribution parameter determined pointwise at each data point. We empirically demonstrate the superior performance of the probabilistic prediction by Wasserstein gradient boosting in comparison with various existing methods.
Generalised Bayesian Inference for Discrete Intractable Likelihood
Matsubara, Takuo, Knoblauch, Jeremias, Briol, François-Xavier, Oates, Chris. J.
Discrete state spaces represent a major computational challenge to statistical inference, since the computation of normalisation constants requires summation over large or possibly infinite sets, which can be impractical. This paper addresses this computational challenge through the development of a novel generalised Bayesian inference procedure suitable for discrete intractable likelihood. Inspired by recent methodological advances for continuous data, the main idea is to update beliefs about model parameters using a discrete Fisher divergence, in lieu of the problematic intractable likelihood. The result is a generalised posterior that can be sampled from using standard computational tools, such as Markov chain Monte Carlo, circumventing the intractable normalising constant. The statistical properties of the generalised posterior are analysed, with sufficient conditions for posterior consistency and asymptotic normality established. In addition, a novel and general approach to calibration of generalised posteriors is proposed. Applications are presented on lattice models for discrete spatial data and on multivariate models for count data, where in each case the methodology facilitates generalised Bayesian inference at low computational cost.
TCE: A Test-Based Approach to Measuring Calibration Error
Matsubara, Takuo, Tax, Niek, Mudd, Richard, Guy, Ido
While a number of metrics--such as log-likelihood, userspecified This paper proposes a new metric to measure the scoring functions, and the area under the receiver calibration error of probabilistic binary classifiers, operating characteristic (ROC) curve--are used to assess the called test-based calibration error (TCE). TCE incorporates quality of probabilistic classifiers, it is usually hard or even a novel loss function based on a statistical impossible to gauge whether predictions are well-calibrated test to examine the extent to which model predictions from the values of these metrics. For assessment of calibration, differ from probabilities estimated from it is typically necessary to use a metric that measures data. It offers (i) a clear interpretation, (ii) a consistent calibration error, that is, a deviation between model predictions scale that is unaffected by class imbalance, and and probabilities of target occurrences estimated from (iii) an enhanced visual representation with repect data. The importance of assessing calibration error has been to the standard reliability diagram. In addition, we long emphasised in machine learning [Nixon et al., 2019, introduce an optimality criterion for the binning Minderer et al., 2021] and in probabilistic forecasting more procedure of calibration error metrics based on a broadly [Dawid, 1982, Degroot and Fienberg, 1983].
Robust Generalised Bayesian Inference for Intractable Likelihoods
Matsubara, Takuo, Knoblauch, Jeremias, Briol, François-Xavier, Oates, Chris. J.
Generalised Bayesian inference updates prior beliefs using a loss function, rather than a likelihood, and can therefore be used to confer robustness against possible misspecification of the likelihood. Here we consider generalised Bayesian inference with a Stein discrepancy as a loss function, motivated by applications in which the likelihood contains an intractable normalisation constant. In this context, the Stein discrepancy circumvents evaluation of the normalisation constant and produces generalised posteriors that are either closed form or accessible using standard Markov chain Monte Carlo. On a theoretical level, we show consistency, asymptotic normality, and bias-robustness of the generalised posterior, highlighting how these properties are impacted by the choice of Stein discrepancy. Then, we provide numerical experiments on a range of intractable distributions, including applications to kernel-based exponential family models and non-Gaussian graphical models.
The Ridgelet Prior: A Covariance Function Approach to Prior Specification for Bayesian Neural Networks
Matsubara, Takuo, Oates, Chris J., Briol, François-Xavier
Bayesian neural networks attempt to combine the strong predictive performance of neural networks with formal quantification of uncertainty associated with the predictive output in the Bayesian framework. However, it remains unclear how to endow the parameters of the network with a prior distribution that is meaningful when lifted into the output space of the network. A possible solution is proposed that enables the user to posit an appropriate covariance function for the task at hand. Our approach constructs a prior distribution for the parameters of the network, called a ridgelet prior, that approximates the posited covariance structure in the output space of the network. The approach is rooted in the ridgelet transform and we establish both finite-sample-size error bounds and the consistency of the approximation of the covariance function in a limit where the number of hidden units is increased. Our experimental assessment is limited to a proof-of-concept, where we demonstrate that the ridgelet prior can out-perform an unstructured prior on regression problems for which an informative covariance function can be a priori provided.
Integral representation of the global minimizer
Sonoda, Sho, Ishikawa, Isao, Ikeda, Masahiro, Hagihara, Kei, Sawano, Yoshihiro, Matsubara, Takuo, Murata, Noboru
We have obtained an integral representation of the shallow neural network that attains the global minimum of its backpropagation (BP) training problem. According to our unpublished numerical simulations conducted several years prior to this study, we had noticed that such an integral representation may exist, but it was not proven until today. First, we introduced a Hilbert space of coefficient functions, and a reproducing kernel Hilbert space (RKHS) of hypotheses, associated with the integral representation. The RKHS reflects the approximation ability of neural networks. Second, we established the ridgelet analysis on RKHS. The analytic property of the integral representation is remarkably clear. Third, we reformulated the BP training as the optimization problem in the space of coefficient functions, and obtained a formal expression of the unique global minimizer, according to the Tikhonov regularization theory. Finally, we demonstrated that the global minimizer is the shrink ridgelet transform. Since the relation between an integral representation and an ordinary finite network is not clear, and BP is convex in the integral representation, we cannot immediately answer the question such as "Is a local minimum a global minimum?" However, the obtained integral representation provides an explicit expression of the global minimizer, without linearity-like assumptions, such as partial linearity and monotonicity. Furthermore, it indicates that the ordinary ridgelet transform provides the minimum norm solution to the original training equation.