Offline Policy Evaluation for Reinforcement Learning with Adaptively Collected Data
Madhow, Sunil, Xiao, Dan, Yin, Ming, Wang, Yu-Xiang
–arXiv.org Artificial Intelligence
Offline Reinforcement Learning (RL), which seeks to perform standard RL tasks using a pre-existing dataset of interactions with an MDP, is a key frontier in the effort to make RL methods more widely applicable. The ability to incorporate existing data into RL algorithms is crucial in many promising application domains. In safety-critical areas, such as autonomous driving (Kiran et al., 2020), the randomized exploration that characterizes online algorithms is not ethically tolerable. Even in lower-stakes applications, such as advertising (Cai et al., 2017), naively adopting online algorithms could mean throwing away vast reserves of previouslycollected data. The development of efficient offline algorithms promises to broaden RL's applicability by allowing practitioners to exercise some much needed domain-specific control over the training process. Given a dataset, D, of interactions with an MDP M, two tasks that we may hope to achieve in offline RL are Offline Policy Evaluation (Yin & Wang, 2020) and Offline Learning (Lange et al., 2012). In Offline Policy Evaluation (OPE), we seek to estimate the value of a target policy π under M. In Offline Learning (OL), the goal is to use D to find a good policy π Π where Π is some policy class. In this paper, we largely focus on OPE.
arXiv.org Artificial Intelligence
Jun-24-2023
- Country:
- Genre:
- Research Report (1.00)
- Industry:
- Health & Medicine (0.69)
- Information Technology > Robotics & Automation (0.34)
- Transportation > Ground
- Road (0.34)
- Technology: