Detommaso, Gianluca, Cui, Tiangang, Marzouk, Youssef, Spantini, Alessio, Scheichl, Robert

Stein variational gradient descent (SVGD) was recently proposed as a general purpose nonparametric variational inference algorithm: it minimizes the Kullback–Leibler divergence between the target distribution and its approximation by implementing a form of functional gradient descent on a reproducing kernel Hilbert space [Liu & Wang, NIPS 2016]. In this paper, we accelerate and generalize the SVGD algorithm by including second-order information, thereby approximating a Newton-like iteration in function space. We also show how second-order information can lead to more effective choices of kernel. We observe significant computational gains over the original SVGD algorithm in multiple test cases.

Detommaso, Gianluca, Cui, Tiangang, Marzouk, Youssef, Scheichl, Robert, Spantini, Alessio

Stein variational gradient descent (SVGD) was recently proposed as a general purpose nonparametric variational inference algorithm [Liu & Wang, NIPS 2016]: it minimizes the Kullback-Leibler divergence between the target distribution and its approximation by implementing a form of functional gradient descent on a reproducing kernel Hilbert space. In this paper, we accelerate and generalize the SVGD algorithm by including second-order information, thereby approximating a Newton-like iteration in function space. We also show how second-order information can lead to more effective choices of kernel. We observe significant computational gains over the original SVGD algorithm in multiple test cases.

Zhuo, Jingwei, Liu, Chang, Chen, Ning, Zhang, Bo

Stein variational gradient descent (SVGD) is a nonparametric inference method, which iteratively transports a set of randomly initialized particles to approximate a differentiable target distribution, along the direction that maximally decreases the KL divergence within a vector-valued reproducing kernel Hilbert space (RKHS). Compared to Monte Carlo methods, SVGD is particle-efficient because of the repulsive force induced by kernels. In this paper, we develop the first analysis about the high dimensional performance of SVGD and emonstrate that the repulsive force drops at least polynomially with increasing dimensions, which results in poor marginal approximation. To improve the marginal inference of SVGD, we propose Marginal SVGD (M-SVGD), which incorporates structural information described by a Markov random field (MRF) into kernels. M-SVGD inherits the particle efficiency of SVGD and can be used as a general purpose marginal inference tool for MRFs. Experimental results on grid based Markov random fields show the effectiveness of our methods.

Bamler, Robert, Zhang, Cheng, Opper, Manfred, Mandt, Stephan

Black box variational inference (BBVI) with reparameterization gradients triggered the exploration of divergence measures other than the Kullback-Leibler (KL) divergence, such as alpha divergences. In this paper, we view BBVI with generalized divergences as a form of estimating the marginal likelihood via biased importance sampling. The choice of divergence determines a bias-variance trade-off between the tightness of a bound on the marginal likelihood (low bias) and the variance of its gradient estimators. Drawing on variational perturbation theory of statistical physics, we use these insights to construct a family of new variational bounds. Enumerated by an odd integer order $K$, this family captures the standard KL bound for $K=1$, and converges to the exact marginal likelihood as $K\to\infty$. Compared to alpha-divergences, our reparameterization gradients have a lower variance. We show in experiments on Gaussian Processes and Variational Autoencoders that the new bounds are more mass covering, and that the resulting posterior covariances are closer to the true posterior and lead to higher likelihoods on held-out data.

We propose a general purpose variational inference algorithm that forms a natural counterpart of gradient descent for optimization. Our method iteratively transports a set of particles to match the target distribution, by applying a form of functional gradient descent that minimizes the KL divergence. Empirical studies are performed on various real world models and datasets, on which our method is competitive with existing state-of-the-art methods. The derivation of our method is based on a new theoretical result that connects the derivative of KL divergence under smooth transforms with Stein's identity and a recently proposed kernelized Stein discrepancy, which is of independent interest.