A fundamental objective in reinforcement learning is the maintenance of a proper balance between exploration and exploitation. This problem becomes more challenging when the agent can only partially observe the states of its environment. In this paper we propose a dual-policy method for jointly learning the agent behavior and the balance between exploration exploitation, in partially observable environments. The method subsumes traditional exploration, in which the agent takes actions to gather information about the environment, and active learning, in which the agent queries an oracle for optimal actions (with an associated cost for employing the oracle). The form of the employed exploration is dictated by the specific problem. Theoretical guarantees are provided concerning the optimality of the balancing of exploration and exploitation. The effectiveness of the method is demonstrated by experimental results on benchmark problems.
Bai, Aijun (University of Science and Technology of China) | Wu, Feng (University of Southampton) | Zhang, Zongzhang (National University of Singapore) | Chen, Xiaoping (University of Science and Technology of China)
Monte-Carlo tree search (MCTS) has been drawing great interest in recent years for planning under uncertainty. One of the key challenges is the trade-off between exploration and exploitation. To address this, we introduce a novel online planning algorithm for large POMDPs using Thompson sampling based MCTS that balances between cumulative and simple regrets. The proposed algorithm Dirichlet-Dirichlet-NormalGamma based Partially Observable Monte-Carlo Planning (D 2 NG-POMCP) treats the accumulated reward of performing an action from a belief state in the MCTS search tree as a random variable following an unknown distribution with hidden parameters. Bayesian method is used to model and infer the posterior distribution of these parameters by choosing the conjugate prior in the form of a combination of two Dirichlet and one NormalGamma distributions. Thompson sampling is exploited to guide the action selection in the search tree. Experimental results confirmed that our algorithm outperforms the state-of-the-art approaches on several common benchmark problems.
We propose a Bayesian nonparametric approach to the problem of jointly modeling multiple related time series. Our approach is based on the discovery of a set of latent, shared dynamical behaviors. Using a beta process prior, the size of the set and the sharing pattern are both inferred from data. We develop efficient Markov chain Monte Carlo methods based on the Indian buffet process representation of the predictive distribution of the beta process, without relying on a truncated model. In particular, our approach uses the sum-product algorithm to efficiently compute Metropolis-Hastings acceptance probabilities, and explores new dynamical behaviors via birth and death proposals. We examine the benefits of our proposed feature-based model on several synthetic datasets, and also demonstrate promising results on unsupervised segmentation of visual motion capture data.
We propose a Bayesian nonparametric approach to relating multiple time series via a set of latent, dynamical behaviors. Using a beta process prior, we allow data-driven selection of the size of this set, as well as the pattern with which behaviors are shared among time series. Via the Indian buffet process representation of the beta process predictive distributions, we develop an exact Markov chain Monte Carlo inference method. In particular, our approach uses the sum-product algorithm to efficiently compute Metropolis-Hastings acceptance probabilities, and explores new dynamical behaviors via birth/death proposals. We validate our sampling algorithm using several synthetic datasets, and also demonstrate promising unsupervised segmentation of visual motion capture data.
Scoring structures of undirected graphical models by means of evaluating the marginal likelihood is very hard. The main reason is the presence of the partition functionwhich is intractable to evaluate, let alone integrate over. We propose to approximate the marginal likelihood by employing two levels of approximation: we assume normality of the posterior (the Laplace approximation) and approximate allremaining intractable quantities using belief propagation and the linear response approximation.