Adversarial Policy Optimization for Offline Preference-based Reinforcement Learning
–arXiv.org Artificial Intelligence
In this paper, we study offline preference-based reinforcement learning (PbRL), where learning is based on pre-collected preference feedback over pairs of trajectories. While offline PbRL has demonstrated remarkable empirical success, existing theoretical approaches face challenges in ensuring conservatism under uncertainty, requiring computationally intractable confidence set constructions. We address this limitation by proposing Adversarial Preference-based Policy Optimization ( APPO), a computationally efficient algorithm for offline PbRL that guarantees sample complexity bounds without relying on explicit confidence sets. By framing PbRL as a two-player game between a policy and a model, our approach enforces conservatism in a tractable manner. Using standard assumptions on function approximation and bounded trajectory concentrability, we derive a sample complexity bound. To our knowledge, APPO is the first offline PbRL algorithm to offer both statistical efficiency and practical applicability. Experimental results on continuous control tasks demonstrate that APPO effectively learns from complex datasets, showing comparable performance with existing state-of-the-art methods. While Reinforcement Learning (RL) has achieved remarkable success in real-world applications (Mnih, 2013; Silver et al., 2017; Kalashnikov et al., 2018; Brohan et al., 2022), its performance heavily depends on the design of the reward function (Wirth et al., 2017), which can be challenging in practice. To address this issue, preference-based reinforcement learning (PbRL), also known as reinforcement learning with human feedback, has gained increasing attention as an alternative to manually designed rewards. In PbRL, a reward model is learned from preference feedback provided by human experts, who compare pairs of trajectories (Christiano et al., 2017). This approach enables the learning process to align better with human intentions. However, collecting preference feedback can be costly, especially when real-time feedback from human experts is required. In such cases, learning from pre-collected data is preferred over online learning. This approach is referred to as offline PbRL, where the learning process relies solely on pre-collected trajectories and preference feedback.
arXiv.org Artificial Intelligence
Mar-7-2025
- Country:
- Asia > South Korea (0.14)
- Genre:
- Research Report (1.00)
- Industry:
- Education > Educational Setting > Online (0.34)
- Technology: