Goto

Collaborating Authors

 pbrl





Preference-based Reinforcement Learning with Finite-Time Guarantees

Neural Information Processing Systems

Preference-based Reinforcement Learning (PbRL) replaces reward values in traditional reinforcement learning by preferences to better elicit human opinion on the target objective, especially when numerical reward values are hard to design or interpret. Despite promising results in applications, the theoretical understanding of PbRL is still in its infancy. In this paper, we present the first finite-time analysis for general PbRL problems. We first show that a unique optimal policy may not exist if preferences over trajectories are deterministic for PbRL. If preferences are stochastic, and the preference probability relates to the hidden reward values, we present algorithms for PbRL, both with and without a simulator, that are able to identify the best policy up to accuracy $\varepsilon$ with high probability. Our method explores the state space by navigating to under-explored states, and solves PbRL using a combination of dueling bandits and policy search. Experiments show the efficacy of our method when it is applied to real-world problems.


Learning Real-World Acrobatic Flight from Human Preferences

Merk, Colin, Geles, Ismail, Xing, Jiaxu, Romero, Angel, Ramponi, Giorgia, Scaramuzza, Davide

arXiv.org Artificial Intelligence

Preference-based reinforcement learning (PbRL) enables agents to learn control policies without requiring manually designed reward functions, making it well-suited for tasks where objectives are difficult to formalize or inherently subjective. Acrobatic flight poses a particularly challenging problem due to its complex dynamics, rapid movements, and the importance of precise execution. In this work, we explore the use of PbRL for agile drone control, focusing on the execution of dynamic maneuvers such as powerloops. Building on Preference-based Proximal Policy Optimization (Preference PPO), we propose Reward Ensemble under Confidence (REC), an extension to the reward learning objective that improves preference modeling and learning stability. Our method achieves 88.4% of the shaped reward performance, compared to 55.2% with standard Preference PPO. We train policies in simulation and successfully transfer them to real-world drones, demonstrating multiple acrobatic maneuvers where human preferences emphasize stylistic qualities of motion. Furthermore, we demonstrate the applicability of our probabilistic reward model in a representative MuJoCo environment for continuous control. Finally, we highlight the limitations of manually designed rewards, observing only 60.7% agreement with human preferences. These results underscore the effectiveness of PbRL in capturing complex, human-centered objectives across both physical and simulated domains.





CLARIFY: Contrastive Preference Reinforcement Learning for Untangling Ambiguous Queries

Mu, Ni, Hu, Hao, Hu, Xiao, Yang, Yiqin, Xu, Bo, Jia, Qing-Shan

arXiv.org Artificial Intelligence

Preference-based reinforcement learning (PbRL) bypasses explicit reward engineering by inferring reward functions from human preference comparisons, enabling better alignment with human intentions. However, humans often struggle to label a clear preference between similar segments, reducing label efficiency and limiting PbRL's real-world applicability. To address this, we propose an offline PbRL method: Contrastive LeArning for ResolvIng Ambiguous Feedback (CLARIFY), which learns a trajectory embedding space that incorporates preference information, ensuring clearly distinguished segments are spaced apart, thus facilitating the selection of more unambiguous queries. Extensive experiments demonstrate that CLARIFY outperforms baselines in both non-ideal teachers and real human feedback settings. Our approach not only selects more distinguished queries but also learns meaningful trajectory embeddings.


Adversarial Policy Optimization for Offline Preference-based Reinforcement Learning

Kang, Hyungkyu, Oh, Min-hwan

arXiv.org Artificial Intelligence

In this paper, we study offline preference-based reinforcement learning (PbRL), where learning is based on pre-collected preference feedback over pairs of trajectories. While offline PbRL has demonstrated remarkable empirical success, existing theoretical approaches face challenges in ensuring conservatism under uncertainty, requiring computationally intractable confidence set constructions. We address this limitation by proposing Adversarial Preference-based Policy Optimization ( APPO), a computationally efficient algorithm for offline PbRL that guarantees sample complexity bounds without relying on explicit confidence sets. By framing PbRL as a two-player game between a policy and a model, our approach enforces conservatism in a tractable manner. Using standard assumptions on function approximation and bounded trajectory concentrability, we derive a sample complexity bound. To our knowledge, APPO is the first offline PbRL algorithm to offer both statistical efficiency and practical applicability. Experimental results on continuous control tasks demonstrate that APPO effectively learns from complex datasets, showing comparable performance with existing state-of-the-art methods. While Reinforcement Learning (RL) has achieved remarkable success in real-world applications (Mnih, 2013; Silver et al., 2017; Kalashnikov et al., 2018; Brohan et al., 2022), its performance heavily depends on the design of the reward function (Wirth et al., 2017), which can be challenging in practice. To address this issue, preference-based reinforcement learning (PbRL), also known as reinforcement learning with human feedback, has gained increasing attention as an alternative to manually designed rewards. In PbRL, a reward model is learned from preference feedback provided by human experts, who compare pairs of trajectories (Christiano et al., 2017). This approach enables the learning process to align better with human intentions. However, collecting preference feedback can be costly, especially when real-time feedback from human experts is required. In such cases, learning from pre-collected data is preferred over online learning. This approach is referred to as offline PbRL, where the learning process relies solely on pre-collected trajectories and preference feedback.