Marta Kwiatkowska is a co-proposer of the Validate AI Conference. She is Professor of Computing Systems and Fellow of Trinity College, University of Oxford. Prior to this she was Professor in the School of Computer Science at the University of Birmingham, Lecturer at the University of Leicester and Assistant Professor at the Jagiellonian University in Cracow, Poland. Kwiatkowska has made fundamental contributions to the theory and practice of model checking for probabilistic systems, focusing on automated techniques for verification and synthesis from quantitative specifications. More recently, she has been working on safety and robustness verification for neural networks with provable guarantees.
The main inspiration for this blog post is based on the work I did on Bayesian Neural Networks with my friend Brian Trippe at the Computational and Biological Learning Lab in Cambridge University. I highly recommend anyone to read Brian's thesis on variational inference in neural networks. Disclaimer: At the Computational and Biological Learning Lab Bayesian machine learning techniques are unapologetically taught as the way forward. As such, be aware of potential bias in this blog post. For example in image classification, x represents an image and y the corresponding image label.
V ARIBAD: A V ERY G OOD M ETHOD FOR B AYES-A DAPTIVE D EEP RL VIA M ETA-L EARNING Luisa Zintgraf University of Oxford Kyriacos Shiarlis Latent Logic Maximilian Igl University of Oxford Sebastian Schulze University of Oxford Y arin Gal OA TML Group, University of Oxford Katja Hofmann Microsoft Research Shimon Whiteson University of Oxford Latent Logic A BSTRACT Trading off exploration and exploitation in an unknown environment is key to maximising expected return during learning. A Bayes-optimal policy, which does so optimally, conditions its actions not only on the environment state but on the agent's uncertainty about the environment. Computing a Bayes-optimal policy is however intractable for all but the smallest tasks. In this paper, we introduce variational Bayes-Adaptive Deep RL (variBAD), a way to meta-learn to perform approximate inference in an unknown environment, and incorporate task uncertainty directly during action selection. In a grid-world domain, we illustrate how variBAD performs structured online exploration as a function of task uncertainty. We also evaluate variBAD on MuJoCo domains widely used in meta-RL and show that it achieves higher return during training than existing methods. 1 I NTRODUCTION Reinforcement learning (RL) is typically concerned with finding an optimal policy that maximises expected return for a given Markov decision process (MDP) with an unknown reward and transition function. If these were known, the optimal policy could in theory be computed without interacting with the environment. By contrast, learning in an unknown environment typically requires trading off exploration (learning about the environment) and exploitation (taking promising actions). Balancing this tradeoff is key to maximising expected return during learning . A Bayes-optimal policy, which does so optimally, conditions actions not only on the environment state but on the agent's own uncertainty about the current MDP . In principle, a Bayes-optimal policy can be computed using the framework of Bayes-adaptive Markov decision processes (BAMDPs) (Martin, 1967; Duff & Barto, 2002). The agent maintains a belief, i.e., a posterior distribution, over possible environments. Augmenting the state space of the underlying MDP with this posterior distribution yields a BAMDP, a special case of a belief MDP (Kaelbling et al., 1998).