Bayesian optimization for backpropagation in Monte-Carlo tree search
The robust nature of MCTS, versus a traditional approach like depth-first search in alpha-beta pruning, has not only enabled a leapfrog in performance in computer Go, but has also led to its utilization in other games where it is difficult to evaluate states, as well as in other domains (Browne et al., 2012). However, MCTS is known to suffer from slow convergence in certain situations (Coquelin and Munos, 2007), in particular when the precise calculation of a narrow tactical sequence is critical for success. For example in boardgames, (Ramanujan et al., 2010) defines a level-k search trap for player p after a move m as a state of the game where the opponent of p has a guaranteed k -move winning strategy . More relevantly, they show through a series of experiments that MCTS performs poorly even in shallow traps, in contrast to regular minimax search; see also (Ramanujan et al., 2011; Ramanujan and Sel-man, 2011). T o better understand this phenomenon, we take a closer look at the update rule Q n Q n 1 R n 1 Q n 1 n (1) which is performed during the backpropagation phase of MCTS. Here, the current estimate of the value of a state is taken to be the simple average of all previous returns accrued upon visiting that state. Proceeding, we discuss various methods which seek to improve backpropagation by challenging the basic assumptions implied by (1): (i) Value estimation by averaging returns: Instead of updating a parent node's value with that of its MAX (MIN) child as in minimax search, backpropagation in MCTS averages all returns to obtain a good signal in noisy environments (this is 1 arXiv:2001.09325v1
Jan-25-2020
- Country:
- North America > United States
- Maryland (0.04)
- Massachusetts > Middlesex County
- Cambridge (0.04)
- Europe > Germany
- Baden-Württemberg > Freiburg (0.04)
- North America > United States
- Genre:
- Research Report (0.40)
- Industry:
- Leisure & Entertainment > Games > Go (0.49)
- Technology: