846c260d715e5b854ffad5f70a516c88-Reviews.html

Neural Information Processing Systems 

The paper proposes a Bayesian inference in Monte-Carlo tree search (MCTS) with Thompson sampling based action-selection strategy, called Dirichlet-NormalGamma MCTS (DNG-MTCS) algorithm. The method approximates the accumulated reward of following the current policy from a state, X_{s,\pi(s)}, by the normal distribution with the NromalGamma distribution prior. The state transition probabilities are estimated via Dirichlet distributions. Action-selection strategy is based on Thompson sampling approach, where the expected cumulative reward for each action is computed with the parametric distribution with parameters drawn from the posterior distributions and then the action with the highest expectation is selected. The authors apply the proposed method to several benchmark tasks and showed that the method can converge (slightly) faster than the UCT algorithm. Theoretical properties about convergence are also provided.