Bird, Thomas
Variational f-divergence Minimization
Zhang, Mingtian, Bird, Thomas, Habib, Raza, Xu, Tianlin, Barber, David
Probabilistic models are often trained by maximum likelihood, which corresponds to minimizing a specific f-divergence between the model and data distribution. In light of recent successes in training Generative Adversarial Networks, alternative non-likelihood training criteria have been proposed. Whilst not necessarily statistically efficient, these alternatives may better match user requirements such as sharp image generation. A general variational method for training probabilistic latent variable models using maximum likelihood is well established; however, how to train latent variable models using other f-divergences is comparatively unknown. We discuss a variational approach that, when combined with the recently introduced Spread Divergence, can be applied to train a large class of latent variable models using any f-divergence.
Spread Divergences
Barber, David, Zhang, Mingtian, Habib, Raza, Bird, Thomas
For distributions p and q with different support, the divergence D(p q) generally will not exist. We define a spread divergence D(p q) on modified p and q and describe sufficient conditions for the existence of such a divergence. We give examples of using a spread divergence to train implicit generative models, including linearmodels (Principal Components Analysis and Independent Components Analysis) and nonlinear models (Deep Generative Networks).
Stochastic Variational Optimization
Bird, Thomas, Kunze, Julius, Barber, David
Variational Optimization forms a differentiable upper bound on an objective. We show that approaches such as Natural Evolution Strategies and Gaussian Perturbation, are special cases of Variational Optimization in which the expectations are approximated by Gaussian sampling. These approaches are of particular interest because they are parallelizable. We calculate the approximate bias and variance of the corresponding gradient estimators and demonstrate that using antithetic sampling or a baseline is crucial to mitigate their problems. We contrast these methods with an alternative parallelizable method, namely Directional Derivatives. We conclude that, for differentiable objectives, using Directional Derivatives is preferable to using Variational Optimization to perform parallel Stochastic Gradient Descent.