Sample Complexity of Offline Distributionally Robust Linear Markov Decision Processes
Wang, He, Shi, Laixi, Chi, Yuejie
–arXiv.org Artificial Intelligence
In reinforcement learning (RL), agents aim to learn an optim al policy that maximizes the expected total rewards, by actively interacting with an unknown environment. However, online data collection may be prohibitively expensive or potentially risky in many real-wor ld applications, e.g., autonomous driving ( Gu et al., 2022), healthcare ( Yu et al., 2021), and wireless security ( Uprety and Rawat, 2020). This motivates the study of offline RL, which leverages existing historical data (aka batch data) collected in the past to improve policy learning, and has attracted growing attention ( Levine et al., 2020). Nonetheless, the performance of the learned policy invoking standard offline RL techniques co uld drop dramatically when the deployed environment shifts from the one experienced by the historical da ta even slightly, necessitating the development of robust RL algorithms that are resilient against environm ental uncertainty. In response, recent years have witnessed a surge of interest s in distributionally robust offline RL ( Zhou et al., 2021b; Yang et al., 2022; Shi and Chi, 2022; Blanchet et al., 2024). In particular, given only historical data from a nominal environment, distributionally robust offline RL aims to learn a policy that optimizes the worst-case performance when the environment falls into some prescribed uncertaint y set around the nominal one. Such a framework ensures that the performance of the lea rned policy does not fail drastically, provided that the distribution shift between the nominal and deployment environments is not exce ssively large. Nevertheless, most existing provable algorithms in distri butionally robust offline RL only focus on the tabular setting with finite state and action spaces ( Zhou et al., 2021b; Yang et al., 2022; Shi and Chi, 2022), where the sample complexity scales linearly with the size of the state-action space, which is prohibitive when the problem is high-dimensional.
arXiv.org Artificial Intelligence
Jun-26-2024
- Country:
- North America > United States
- Pennsylvania > Allegheny County > Pittsburgh (0.04)
- Europe > United Kingdom
- England > Cambridgeshire > Cambridge (0.04)
- Asia > Middle East
- Jordan (0.04)
- North America > United States
- Genre:
- Research Report (0.50)
- Industry:
- Health & Medicine (0.34)
- Information Technology (0.34)
- Transportation > Ground
- Road (0.34)