Tree Search-Based Policy Optimization under Stochastic Execution Delay
Valensi, David, Derman, Esther, Mannor, Shie, Dalal, Gal
–arXiv.org Artificial Intelligence
The standard formulation of Markov decision processes (MDPs) assumes that the agent's decisions are executed immediately. However, in numerous realistic applications such as robotics or healthcare, actions are performed with a delay whose value can even be stochastic. In this work, we introduce stochastic delayed execution MDPs, a new formalism addressing random delays without resorting to state augmentation. We show that given observed delay values, it is sufficient to perform a policy search in the class of Markov policies in order to reach optimal performance, thus extending the deterministic fixed delay case. Armed with this insight, we devise DEZ, a model-based algorithm that optimizes over the class of Markov policies. Thus, it handles delayed execution while preserving the sample efficiency of EfficientZero. Through a series of experiments on the Atari suite, we demonstrate that although the previous baseline outperforms the naive method in scenarios with constant delay, it underperforms in the face of stochastic delays. In contrast, our approach significantly outperforms the baselines, for both constant and stochastic delays. The conventional Markov decision process (MDP) framework commonly assumes that all of the information necessary for the next decision step is available in real time: the agent's current state is immediately observed, its chosen action instantly actuated, and the corresponding reward feedback concurrently perceived (Puterman, 2014). However, these input signals are often delayed in real-world applications such as robotics (Mahmood et al., 2018), healthcare (Politi et al., 2022), or autonomous systems, where they can manifest in different ways.
arXiv.org Artificial Intelligence
Apr-8-2024
- Genre:
- Research Report (0.64)
- Industry:
- Health & Medicine (0.54)
- Leisure & Entertainment (0.48)
- Technology: