Goto

Collaborating Authors

 sadigh


Listwise Reward Estimation for Offline Preference-based Reinforcement Learning

Choi, Heewoong, Jung, Sangwon, Ahn, Hongjoon, Moon, Taesup

arXiv.org Artificial Intelligence

In Reinforcement Learning (RL), designing precise reward functions remains to be a challenge, particularly when aligning with human intent. Preference-based RL (PbRL) was introduced to address this problem by learning reward models from human feedback. However, existing PbRL methods have limitations as they often overlook the second-order preference that indicates the relative strength of preference. In this paper, we propose Listwise Reward Estimation (LiRE), a novel approach for offline PbRL that leverages second-order preference information by constructing a Ranked List of Trajectories (RLT), which can be efficiently built by using the same ternary feedback type as traditional methods. To validate the effectiveness of LiRE, we propose a new offline PbRL dataset that objectively reflects the effect of the estimated rewards. Our extensive experiments on the dataset demonstrate the superiority of LiRE, i.e., outperforming state-of-the-art baselines even with modest feedback budgets and enjoying robustness with respect to the number of feedbacks and feedback noise. Our code is available at https://github.com/chwoong/LiRE


A Generalized Acquisition Function for Preference-based Reward Learning

Ellis, Evan, Ghosal, Gaurav R., Russell, Stuart J., Dragan, Anca, Bıyık, Erdem

arXiv.org Artificial Intelligence

Preference-based reward learning is a popular technique for teaching robots and autonomous systems how a human user wants them to perform a task. Previous works have shown that actively synthesizing preference queries to maximize information gain about the reward function parameters improves data efficiency. The information gain criterion focuses on precisely identifying all parameters of the reward function. This can potentially be wasteful as many parameters may result in the same reward, and many rewards may result in the same behavior in the downstream tasks. Instead, we show that it is possible to optimize for learning the reward function up to a behavioral equivalence class, such as inducing the same ranking over behaviors, distribution over choices, or other related definitions of what makes two rewards similar. We introduce a tractable framework that can capture such definitions of similarity. Our experiments in a synthetic environment, an assistive robotics environment with domain transfer, and a natural language processing problem with real datasets demonstrate the superior performance of our querying method over the state-of-the-art information gain method.


Active Inverse Learning in Stackelberg Trajectory Games

Yu, Yue, Levy, Jacob, Mehr, Negar, Fridovich-Keil, David, Topcu, Ufuk

arXiv.org Artificial Intelligence

Game-theoretic inverse learning is the problem of inferring the players' objectives from their actions. We formulate an inverse learning problem in a Stackelberg game between a leader and a follower, where each player's action is the trajectory of a dynamical system. We propose an active inverse learning method for the leader to infer which hypothesis among a finite set of candidates describes the follower's objective function. Instead of using passively observed trajectories like existing methods, the proposed method actively maximizes the differences in the follower's trajectories under different hypotheses to accelerate the leader's inference. We demonstrate the proposed method in a receding-horizon repeated trajectory game. Compared with uniformly random inputs, the leader inputs provided by the proposed method accelerate the convergence of the probability of different hypotheses conditioned on the follower's trajectory by orders of magnitude.


Active Reward Learning from Online Preferences

Myers, Vivek, Bıyık, Erdem, Sadigh, Dorsa

arXiv.org Artificial Intelligence

Robot policies need to adapt to human preferences and/or new environments. Human experts may have the domain knowledge required to help robots achieve this adaptation. However, existing works often require costly offline re-training on human feedback, and those feedback usually need to be frequent and too complex for the humans to reliably provide. To avoid placing undue burden on human experts and allow quick adaptation in critical real-world situations, we propose designing and sparingly presenting easy-to-answer pairwise action preference queries in an online fashion. Our approach designs queries and determines when to present them to maximize the expected value derived from the queries' information. We demonstrate our approach with experiments in simulation, human user studies, and real robot experiments. In these settings, our approach outperforms baseline techniques while presenting fewer queries to human experts. Experiment videos, code and appendices are found at https://sites.google.com/view/onlineactivepreferences.


Learning Multimodal Rewards from Rankings

Myers, Vivek, Bıyık, Erdem, Anari, Nima, Sadigh, Dorsa

arXiv.org Artificial Intelligence

Learning from human feedback has shown to be a useful approach in acquiring robot reward functions. However, expert feedback is often assumed to be drawn from an underlying unimodal reward function. This assumption does not always hold including in settings where multiple experts provide data or when a single expert provides data for different tasks -- we thus go beyond learning a unimodal reward and focus on learning a multimodal reward function. We formulate the multimodal reward learning as a mixture learning problem and develop a novel ranking-based learning approach, where the experts are only required to rank a given set of trajectories. Furthermore, as access to interaction data is often expensive in robotics, we develop an active querying approach to accelerate the learning process. We conduct experiments and user studies using a multi-task variant of OpenAI's LunarLander and a real Fetch robot, where we collect data from multiple users with different preferences. The results suggest that our approach can efficiently learn multimodal reward functions, and improve data-efficiency over benchmark methods that we adapt to our learning problem.


APReL: A Library for Active Preference-based Reward Learning Algorithms

Bıyık, Erdem, Talati, Aditi, Sadigh, Dorsa

arXiv.org Artificial Intelligence

Reward learning is a fundamental problem in robotics to have robots that operate in alignment with what their human user wants. Many preference-based learning algorithms and active querying techniques have been proposed as a solution to this problem. In this paper, we present APReL, a library for active preference-based reward learning algorithms, which enable researchers and practitioners to experiment with the existing techniques and easily develop their own algorithms for various modules of the problem.


The key to smarter robot collaborators may be more simplicity

#artificialintelligence

Think of all the subconscious processes you perform while you're driving. As you take in information about the surrounding vehicles, you're anticipating how they might move and thinking on the fly about how you'd respond to those maneuvers. You may even be thinking about how you might influence the other drivers based on what they think you might do. If robots are to integrate seamlessly into our world, they'll have to do the same. Now researchers from Stanford University and Virginia Tech have proposed a new technique to help robots perform this kind of behavioral modeling, which they will present at the annual international Conference on Robot Learning next week.


Artificial Intelligence Will Do What We Ask. That's a Problem. Quanta Magazine

#artificialintelligence

The danger of having artificially intelligent machines do our bidding is that we might not be careful enough about what we wish for. The lines of code that animate these machines will inevitably lack nuance, forget to spell out caveats, and end up giving AI systems goals and incentives that don't align with our true preferences. A now-classic thought experiment illustrating this problem was posed by the Oxford philosopher Nick Bostrom in 2003. Bostrom imagined a superintelligent robot, programmed with the seemingly innocuous goal of manufacturing paper clips. The robot eventually turns the whole world into a giant paper clip factory. Such a scenario can be dismissed as academic, a worry that might arise in some far-off future.


WATCH: Self-Driving Cars Need To Learn How Humans Drive

NPR Technology

One researcher is putting real humans into computerized driving simulations to help self-driving cars learn human behavior. In the not-too-distant future, Americans will be sharing the road with self-driving cars. Companies are pouring billions of dollars into developing self-driving vehicles. Waymo, formerly the Google self-driving-car project, says that its self-driving cars have already driven millions of miles on the open road. Stanford University assistant professor Dorsa Sadigh has ridden in self-driving cars.