Kwon, Minae
Evaluating Human-Language Model Interaction
Lee, Mina, Srivastava, Megha, Hardy, Amelia, Thickstun, John, Durmus, Esin, Paranjape, Ashwin, Gerard-Ursin, Ines, Li, Xiang Lisa, Ladhak, Faisal, Rong, Frieda, Wang, Rose E., Kwon, Minae, Park, Joon Sung, Cao, Hancheng, Lee, Tony, Bommasani, Rishi, Bernstein, Michael, Liang, Percy
Many real-world applications of language models (LMs), such as writing assistance and code autocomplete, involve human-LM interaction. However, most benchmarks are non-interactive in that a model produces output without human involvement. To evaluate human-LM interaction, we develop a new framework, Human-AI Language-based Interaction Evaluation (HALIE), that defines the components of interactive systems and dimensions to consider when designing evaluation metrics. Compared to standard, non-interactive evaluation, HALIE captures (i) the interactive process, not only the final output; (ii) the first-person subjective experience, not just a third-party assessment; and (iii) notions of preference beyond quality (e.g., enjoyment and ownership). We then design five tasks to cover different forms of interaction: social dialogue, question answering, crossword puzzles, summarization, and metaphor generation. With four state-of-the-art LMs (three variants of OpenAI's GPT-3 and AI21 Labs' Jurassic-1), we find that better non-interactive performance does not always translate to better human-LM interaction. In particular, we highlight three cases where the results from non-interactive and interactive metrics diverge and underscore the importance of human-LM interaction for LM evaluation.
Distilling and Retrieving Generalizable Knowledge for Robot Manipulation via Language Corrections
Zha, Lihan, Cui, Yuchen, Lin, Li-Heng, Kwon, Minae, Arenas, Montserrat Gonzalez, Zeng, Andy, Xia, Fei, Sadigh, Dorsa
Today's robot policies exhibit subpar performance when faced with the challenge of generalizing to novel environments. Human corrective feedback is a crucial form of guidance to enable such generalization. However, adapting to and learning from online human corrections is a non-trivial endeavor: not only do robots need to remember human feedback over time to retrieve the right information in new settings and reduce the intervention rate, but also they would need to be able to respond to feedback that can be arbitrary corrections about high-level human preferences to low-level adjustments to skill parameters. In this work, we present Distillation and Retrieval of Online Corrections (DROC), a large language model (LLM)-based system that can respond to arbitrary forms of language feedback, distill generalizable knowledge from corrections, and retrieve relevant past experiences based on textual and visual similarity for improving performance in novel settings. DROC is able to respond to a sequence of online language corrections that address failures in both high-level task plans and low-level skill primitives. We demonstrate that DROC effectively distills the relevant information from the sequence of online corrections in a knowledge base and retrieves that knowledge in settings with new task or object instances. DROC outperforms other techniques that directly generate robot code via LLMs by using only half of the total number of corrections needed in the first round and requires little to no corrections after two iterations. We show further results, videos, prompts and code on https://sites.google.com/stanford.edu/droc .
Toward Grounded Social Reasoning
Kwon, Minae, Hu, Hengyuan, Myers, Vivek, Karamcheti, Siddharth, Dragan, Anca, Sadigh, Dorsa
Consider a robot tasked with tidying a desk with a meticulously constructed Lego sports car. A human may recognize that it is not socially appropriate to disassemble the sports car and put it away as part of the "tidying". How can a robot reach that conclusion? Although large language models (LLMs) have recently been used to enable social reasoning, grounding this reasoning in the real world has been challenging. To reason in the real world, robots must go beyond passively querying LLMs and *actively gather information from the environment* that is required to make the right decision. For instance, after detecting that there is an occluded car, the robot may need to actively perceive the car to know whether it is an advanced model car made out of Legos or a toy car built by a toddler. We propose an approach that leverages an LLM and vision language model (VLM) to help a robot actively perceive its environment to perform grounded social reasoning. To evaluate our framework at scale, we release the MessySurfaces dataset which contains images of 70 real-world surfaces that need to be cleaned. We additionally illustrate our approach with a robot on 2 carefully designed surfaces. We find an average 12.9% improvement on the MessySurfaces benchmark and an average 15% improvement on the robot experiments over baselines that do not use active perception. The dataset, code, and videos of our approach can be found at https://minaek.github.io/groundedsocialreasoning.
Reward Design with Language Models
Kwon, Minae, Xie, Sang Michael, Bullard, Kalesha, Sadigh, Dorsa
Reward design in reinforcement learning (RL) is challenging since specifying human notions of desired behavior may be difficult via reward functions or require many expert demonstrations. Can we instead cheaply design rewards using a natural language interface? This paper explores how to simplify reward design by prompting a large language model (LLM) such as GPT-3 as a proxy reward function, where the user provides a textual prompt containing a few examples (few-shot) or a description (zero-shot) of the desired behavior. Our approach leverages this proxy reward function in an RL framework. Specifically, users specify a prompt once at the beginning of training. During training, the LLM evaluates an RL agent's behavior against the desired behavior described by the prompt and outputs a corresponding reward signal. The RL agent then uses this reward to update its behavior. We evaluate whether our approach can train agents aligned with user objectives in the Ultimatum Game, matrix games, and the DealOrNoDeal negotiation task. In all three tasks, we show that RL agents trained with our framework are well-aligned with the user's objectives and outperform RL agents trained with reward functions learned via supervised learning
Targeted Data Acquisition for Evolving Negotiation Agents
Kwon, Minae, Karamcheti, Siddharth, Cuellar, Mariano-Florentino, Sadigh, Dorsa
Consider a standard non-cooperative negotiation game (Deming et al., 1944; Successful negotiators must learn how to balance Nash, 1950; 1951) as shown in Figure 1 where two agents - optimizing for self-interest and cooperation. Yet Alice and Bob - are trying to agree on an allocation of shared current artificial negotiation agents often heavily resources. Both have high utility associated with the hats depend on the quality of the static datasets they and balls, though Alice also cares about books. Effectively were trained on, limiting their capacity to fashion employing negotiation is crucial, and is the only way to an adaptive response balancing self-interest and reach an equitable outcome - dividing the hats and balls cooperation. For this reason, we find that these evenly, while giving Alice the book. Even where negotiating agents can achieve either high utility or cooperation, agents have incentives that make it challenging for them to but not both. To address this, we introduce cooperate, it would be difficult to imagine that negotiation a targeted data acquisition framework where we could be useful to agents over time -- let alone society -- guide the exploration of a reinforcement learning if agents were incapable of cooperating to achieve equitable agent using annotations from an expert oracle.