$\pi2\text{vec}$: Policy Representations with Successor Features
Scarpellini, Gianluca, Konyushkova, Ksenia, Fantacci, Claudio, Paine, Tom Le, Chen, Yutian, Denil, Misha
–arXiv.org Artificial Intelligence
Robot time is an important bottleneck in applying reinforcement learning in real life. The lack of sufficient training data has driven progress in sim2real, offline reinforcement learning (offline RL), and data efficient learning. However, these approaches do not address the data requirements of policy evaluation. Various proxy metrics were introduced to replace the evaluation on the real robotic system. For example, in sim2real we might measure the performance in simulation (Lee et al., 2021), while in offline RL we can rely on Off-policy Policy Evaluation (OPE) methods (Precup, 2000; Li et al., 2011; Gulcehre et al., 2020; Fu et al., 2021) As we are usually interested in deploying a policy in the real world, recent works narrowed the problem by focusing on Offline Policy Selection (OPS), where the goal is picking the best performing policy from offline data. While these methods are useful for determining coarse relative performance of policies, one still needs time on real robot for more reliable estimates.
arXiv.org Artificial Intelligence
Jun-16-2023