MCTS Based on Simple Regret

Tolpin, David (Ben-Gurion University of the Negev) | Shimony, Solomon Eyal (Ben-Gurion University of the Negev)

Jul-21-2012–AAAI Conferences

UCT, a state-of-the art algorithm for Monte Carlo tree search (MCTS) in games and Markov decision processes, is based on UCB, a sampling policy for the Multi-armed Bandit problem (MAB) that minimizes the cumulative regret. However, search differs from MAB in that in MCTS it is usually only the final ``arm pull'' (the actual move selection) that collects a reward, rather than all ``arm pulls''. Therefore, it makes more sense to minimize the simple regret, as opposed to the cumulative regret. We begin by introducing policies for multi-armed bandits with lower finite-time and asymptotic simple regret than UCB, using it to develop a two-stage scheme (SR+CR) for MCTS which outperforms UCT empirically. Optimizing the sampling process is itself a metareasoning problem, a solution of which can use value of information (VOI) techniques. Although the theory of VOI for search exists, applying it to MCTS is non-trivial, as typical myopic assumptions fail. Lacking a complete working VOI theory for MCTS, we nevertheless propose a sampling scheme that is ``aware'' of VOI, achieving an algorithm that in empirical evaluation outperforms both UCT and the other proposed algorithms.

artificial intelligence, data mining, simple regret, (19 more...)

AAAI Conferences

Jul-21-2012

Conferences PDF

Add feedback

Country:
- North America
  - Canada > Alberta (0.14)
  - United States
    - Massachusetts > Middlesex County
      - Cambridge (0.04)
    - California > Alameda County
      - Berkeley (0.04)
- Asia > Middle East
  - Israel > Southern District > Beer-Sheva (0.04)

Technology:
- Information Technology
  - Data Science > Data Mining
    - Big Data (1.00)
  - Artificial Intelligence > Representation & Reasoning
    - Search (0.91)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found