Online Policy Optimization for Robust MDP

Dong, Jing, Li, Jingwei, Wang, Baoxiang, Zhang, Jingzhao

arXiv.org Artificial Intelligence 

The rapid progress of reinforcement learning (RL) algorithms enables trained agents to navigate around complicated environments and solve complex tasks. The standard reinforcement learning methods, however, may fail catastrophically in another environment, even if the two environments only differ slightly in dynamics [Farebrother et al., 2018, Packer et al., 2018, Cobbe et al., 2019, Song et al., 2019, Raileanu and Fergus, 2021]. In practical applications, such mismatch of environment dynamics are common and can be caused by a number of reasons, e.g., model deviation due to incomplete data, unexpected perturbation and possible adversarial attacks. Part of the sensitivity of standard RL algorithms stems from the formulation of the underlying Markov decision process (MDP). In a sequence of interactions, MDP assumes the dynamic to be unchanged, and the trained agent to be tested on the same dynamic thereafter. To model the potential mismatch between system dynamics, the framework of robust MDP is introduced to account for the uncertainty of the parameters of the MDP [Satia and Lave Jr, 1973, White III and Eldeib, 1994, Nilim and El Ghaoui, 2005, Iyengar, 2005].

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found