W(leaf,i) r+ γ V(s0) s env.RESET() solution [ ].List of actions N(leaf,i) 1 for 1 Lp do Q(leaf,i) W(leaf,i) actions PLANNER(s) function UPDATE(path, leaf)
–Neural Information Processing Systems
A.1 MCTS-kSubS algorithm In Algorithm 4 we present a general MCTS solver based on AlphaZero. Solver repeatedly queries the planner for a list of actions and executes them one by one. Baseline planner returns only a single action at a time, whereas MCTS-kSubS gives around kactions - to reach the desired subgoal (number of actions depends on a subgoal distance, which not always equals k in practice). MCTS-kSubS operates on a high-level subgoal graph: nodes are subgoals proposed by the generator (see Algorithm 3) and edges - lists of actions informing how to move from one subgoal to another (computed by the low-level conditional policy in Algorithm 2). The graph structure is represented by treevariable. For every subgoal, it keeps up to C3 best nearby subgoals (according to generator scores) along with a mentioned list of actions and sum of rewards to obtain while moving from the parent to the child subgoal. Most of MCTS implementation is shared between MCTS-kSubS and AlphaZero baseline, as we can treat the behavioral-cloning policy as a subgoal generator with k = 1. MCTS-kSubS and the baseline are encapsulated in GEN_CHILDREN function (Algorithms 5 and 6).
Neural Information Processing Systems
Apr-24-2026, 11:50:34 GMT