Probabilistic Graphical Modeling and Variational Inference play an important role in recent advances in Deep Reinforcement Learning. Aiming at a self-consistent tutorial survey, this article illustrates basic concepts of reinforcement learning with Probabilistic Graphical Models, as well as derivation of some basic formula as a recap. Reviews and comparisons on recent advances in deep reinforcement learning with different research directions are made from various aspects. We offer Probabilistic Graphical Models, detailed explanation and derivation to several use cases of Variational Inference, which serve as a complementary material on top of the original contributions.
Applying probabilistic models to reinforcement learning (RL) enables the uses of powerful optimisation tools such as variational inference in RL. However, existing inference frameworks and their algorithms pose significant challenges for learning optimal policies, e.g., the lack of mode capturing behaviour in pseudo-likelihood methods, difficulties learning deterministic policies in maximum entropy RL based approaches, and a lack of analysis when function approximators are used. We propose VIREL, a theoretically grounded probabilistic inference framework for RL that utilises a parametrised action-value function to summarise future dynamics of the underlying MDP, generalising existing approaches. VIREL also benefits from a mode-seeking form of KL divergence, the ability to learn deterministic optimal polices naturally from inference, and the ability to optimise value functions and policies in separate, iterative steps. In applying variational expectation-maximisation to VIREL, we thus show that the actor-critic algorithm can be reduced to expectation-maximisation, with policy improvement equivalent to an E-step and policy evaluation to an M-step.
In a variety of problems originating in supervised, unsupervised, and reinforcement learning, the loss function is defined by an expectation over a collection of random variables, which might be part of a probabilistic model or the external world. Estimating the gradient of this loss function, using samples, lies at the core of gradient-based learning algorithms for these problems. We introduce the formalism of stochastic computation graphs--directed acyclic graphs that include both deterministic functions and conditional probability distributions and describe how to easily and automatically derive an unbiased estimator of the loss function's gradient. The resulting algorithm for computing the gradient estimator is a simple modification of the standard backpropagation algorithm. The generic scheme we propose unifies estimators derived in variety of prior work, along with variance-reduction techniques therein. It could assist researchers in developing intricate models involving a combination of stochastic and deterministic operations, enabling, for example, attention, memory, and control actions.
State space models (SSM) have been widely applied for the analysis and visualization of large sequential datasets. Sequential Monte Carlo (SMC) is a very popular particle-based method to sample latent states from intractable posteriors. However, SSM is significantly influenced by the choice of the proposal. Recently Hamiltonian Monte Carlo (HMC) sampling has shown success in many practical problems. In this paper, we propose an SMC augmented by HMC (HSMC) for inference and model learning of nonlinear SSM, which can exempt us from learning proposals and reduce the model complexity significantly. Based on the measure preserving property of HMC, the particles directly generated by transition function can approximate the posterior of latent states arbitrarily well. In order to better adapt to the local geometry of latent space, the HMC is conducted on Riemannian manifold defined by a positive definite metric. In addition, we show that the proposed HSMC method can improve SSMs realized by both Gaussian Processes (GP) and Neural Network (NN).
Despite the advances in the representational capacity of approximate distributions for variational inference, the optimization process can still limit the density that is ultimately learned. We demonstrate the drawbacks of biasing the true posterior to be unimodal, and introduce Annealed Variational Objectives (AVO) into the training of hierarchical variational methods. Inspired by Annealed Importance Sampling, the proposed method facilitates learning by incorporating energy tempering into the optimization objective. In our experiments, we demonstrate our method's robustness to deterministic warm up, and the benefits of encouraging exploration in the latent space.