Online Policy Optimization for Robust MDP
Dong, Jing, Li, Jingwei, Wang, Baoxiang, Zhang, Jingzhao
–arXiv.org Artificial Intelligence
The rapid progress of reinforcement learning (RL) algorithms enables trained agents to navigate around complicated environments and solve complex tasks. The standard reinforcement learning methods, however, may fail catastrophically in another environment, even if the two environments only differ slightly in dynamics [Farebrother et al., 2018, Packer et al., 2018, Cobbe et al., 2019, Song et al., 2019, Raileanu and Fergus, 2021]. In practical applications, such mismatch of environment dynamics are common and can be caused by a number of reasons, e.g., model deviation due to incomplete data, unexpected perturbation and possible adversarial attacks. Part of the sensitivity of standard RL algorithms stems from the formulation of the underlying Markov decision process (MDP). In a sequence of interactions, MDP assumes the dynamic to be unchanged, and the trained agent to be tested on the same dynamic thereafter. To model the potential mismatch between system dynamics, the framework of robust MDP is introduced to account for the uncertainty of the parameters of the MDP [Satia and Lave Jr, 1973, White III and Eldeib, 1994, Nilim and El Ghaoui, 2005, Iyengar, 2005].
arXiv.org Artificial Intelligence
Sep-28-2022
- Country:
- North America > United States
- Washington > King County > Seattle (0.04)
- Asia > China
- Hong Kong (0.04)
- Guangdong Province > Shenzhen (0.04)
- North America > United States
- Genre:
- Research Report (0.50)