policy selection
Online Adaptive Policy Selection in Time-Varying Systems: No-Regret via Contractive Perturbations
We study online adaptive policy selection in systems with time-varying costs and dynamics. We develop the Gradient-based Adaptive Policy Selection (GAPS) algorithm together with a general analytical framework for online policy selection via online optimization. Under our proposed notion of contractive policy classes, we show that GAPS approximates the behavior of an ideal online gradient descent algorithm on the policy parameters while requiring less information and computation. When convexity holds, our algorithm is the first to achieve optimal policy regret. When convexity does not hold, we provide the first local regret bound for online policy selection. Our numerical experiments show that GAPS can adapt to changing environments more quickly than existing benchmarks.
Active Offline Policy Selection
This paper addresses the problem of policy selection in domains with abundant logged data, but with a restricted interaction budget. Solving this problem would enable safe evaluation and deployment of offline reinforcement learning policies in industry, robotics, and recommendation domains among others. Several off-policy evaluation (OPE) techniques have been proposed to assess the value of policies using only logged data. However, there is still a big gap between the evaluation by OPE and the full online evaluation in the real environment. Yet, large amounts of online interactions are often not possible in practice. To overcome this problem, we introduce active offline policy selection --- a novel sequential decision approach that combines logged data with online interaction to identify the best policy. This approach uses OPE estimates to warm start the online evaluation. Then, in order to utilize the limited environment interactions wisely we decide which policy to evaluate next based on a Bayesian optimization method with a kernel function that represents policy similarity. We use multiple benchmarks with a large number of candidate policies to show that the proposed approach improves upon state-of-the-art OPE estimates and pure online policy evaluation.
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- Europe > France > Hauts-de-France > Nord > Lille (0.04)
- Asia > Middle East > Jordan (0.04)
- (2 more...)
- Research Report > Experimental Study (0.68)
- Research Report > New Finding (0.67)
- Health & Medicine (1.00)
- Education (1.00)
51200d29d1fc15f5a71c1dab4bb54f7c-AuthorFeedback.pdf
We would like to thank our reviewers for their thoughtful comments and feedback. However, to preserve anonymity, we can not share the link to the repository. Our most challenging tasks are locomotion tasks, which are not well suited for human demonstrations. But we believe this is an important direction for research as well. We will add this rationale to the paper.