Online Planning with Lookahead Policies

Neural Information Processing Systems 

Real Time Dynamic Programming (RTDP) is an online algorithm based on Dynamic Programming (DP) that acts by 1-step greedy planning. Unlike DP, RTDP does not require access to the entire state space, i.e., it explicitly handles the exploration. This fact makes RTDP particularly appealing when the state space is large and it is not possible to update all states simultaneously. In this we devise a multi-step greedy RTDP algorithm, which we call h -RTDP, that replaces the 1-step greedy policy with a h -step lookahead policy. We analyze h -RTDP in its exact form and establish that increasing the lookahead horizon, h, results in an improved sample complexity, with the cost of additional computations.