Goto

Collaborating Authors

 true posterior


Stepwise Variational Inference with Vine Copulas

arXiv.org Machine Learning

We propose stepwise variational inference (VI) with vine copulas: a universal VI procedure that combines vine copulas with a novel stepwise estimation procedure of the variational parameters. Vine copulas consist of a nested sequence of trees built from copulas, where more complex latent dependence can be modeled with increasing number of trees. We propose to estimate the vine copula approximate posterior in a stepwise fashion, tree by tree along the vine structure. Further, we show that the usual backward Kullback-Leibler divergence cannot recover the correct parameters in the vine copula model, thus the evidence lower bound is defined based on the Rényi divergence. Finally, an intuitive stopping criterion for adding further trees to the vine eliminates the need to pre-define a complexity parameter of the variational distribution, as required for most other approaches. Thus, our method interpolates between mean-field VI (MFVI) and full latent dependence. In many applications, in particular sparse Gaussian processes, our method is parsimonious with parameters, while outperforming MFVI.




e3844e186e6eb8736e9f53c0c5889527-Paper.pdf

Neural Information Processing Systems

Inference networks oftraditional Variational Autoencoders (VAEs) aretypically amortized, resulting in relatively inaccurate posterior approximation compared to instance-wise variational optimization. Recent semi-amortized approaches were proposedtoaddress thisdrawback; however,theiriterativegradient update procedures can be computationally demanding.



Comments on the main proof strategy

Neural Information Processing Systems

We thank the reviewer for the insightful comments on the proof. We will clarify better in the main text notions like "overparamaterise" or "fully trained". We further evaluate the robustness of deep ensembles on a subset of the NNs employed in Section 5.3. Table 1: FGSM and PGD attacks on the network employed in Section 5.2. For deterministic NNs Theorem 1 does not hold.



T. (21) Fromtheaboveequation,ker h=span h 0d0 n, Φ(2)

Neural Information Processing Systems

The last equation is derived as follows. Inaddition, we set the observation varianceσx to 0.25. Logistic(;µ,s) is the density function of a logistic distribution with the location parameterµand the scale parameters,andσ isthe logistic sigmoid function. Before each activation, we apply the layer normalization [Ba et al., 2016] to stabilize training. When the model has sufficiently high expressive power,b may diverge to infinity [Rezende and Viola, 2018], so we add a regularization term of(b+2ζ( b))/m to the loss function, wherem is the number of training examples.


176a579942089c4cdc70136c567932ab-Paper-Conference.pdf

Neural Information Processing Systems

We consider here the sparse Gaussian process regression (SGPR) approach introduced by Titsias [31], which is widely used in practice (see [1, 9] for implementations) and has been studied in many recent works [13,21,5,6,38,28,32,22,23].