Qian, Rebecca
Browsing Lost Unformed Recollections: A Benchmark for Tip-of-the-Tongue Search and Reasoning
CH-Wang, Sky, Deshpande, Darshan, Muresan, Smaranda, Kannappan, Anand, Qian, Rebecca
We introduce Browsing Lost Unformed Recollections, a tip-of-the-tongue known-item search and reasoning benchmark for general AI assistants. BLUR introduces a set of 573 real-world validated questions that demand searching and reasoning across multi-modal and multilingual inputs, as well as proficient tool use, in order to excel on. Humans easily ace these questions (scoring on average 98%), while the best-performing system scores around 56%. To facilitate progress toward addressing this challenging and aspirational use case for general AI assistants, we release 350 questions through a public leaderboard, retain the answers to 250 of them, and have the rest as a private test set.
GLIDER: Grading LLM Interactions and Decisions using Explainable Ranking
Deshpande, Darshan, Ravi, Selvan Sunitha, CH-Wang, Sky, Mielczarek, Bartosz, Kannappan, Anand, Qian, Rebecca
The LLM-as-judge paradigm is increasingly being adopted for automated evaluation of model outputs. While LLM judges have shown promise on constrained evaluation tasks, closed source LLMs display critical shortcomings when deployed in real world applications due to challenges of fine grained metrics and explainability, while task specific evaluation models lack cross-domain generalization. We introduce GLIDER, a powerful 3B evaluator LLM that can score any text input and associated context on arbitrary user defined criteria. GLIDER shows higher Pearson's correlation than GPT-4o on FLASK and greatly outperforms prior evaluation models, achieving comparable performance to LLMs 17x its size. GLIDER supports fine-grained scoring, multilingual reasoning, span highlighting and was trained on 685 domains and 183 criteria. Extensive qualitative analysis shows that GLIDER scores are highly correlated with human judgments, with 91.3% human agreement. We have open-sourced GLIDER to facilitate future research.
Lynx: An Open Source Hallucination Evaluation Model
Ravi, Selvan Sunitha, Mielczarek, Bartosz, Kannappan, Anand, Kiela, Douwe, Qian, Rebecca
Retrieval Augmented Generation (RAG) techniques aim to mitigate hallucinations in Large Language Models (LLMs). However, LLMs can still produce information that is unsupported or contradictory to the retrieved contexts. We introduce LYNX, a SOTA hallucination detection LLM that is capable of advanced reasoning on challenging real-world hallucination scenarios. To evaluate LYNX, we present HaluBench, a comprehensive hallucination evaluation benchmark, consisting of 15k samples sourced from various real-world domains. Our experiment results show that LYNX outperforms GPT-4o, Claude-3-Sonnet, and closed and open-source LLM-as-a-judge models on HaluBench. We release LYNX, HaluBench and our evaluation code for public access.
SimpleSafetyTests: a Test Suite for Identifying Critical Safety Risks in Large Language Models
Vidgen, Bertie, Scherrer, Nino, Kirk, Hannah Rose, Qian, Rebecca, Kannappan, Anand, Hale, Scott A., Rรถttger, Paul
The past year has seen rapid acceleration in the development of large language models (LLMs). However, without proper steering and safeguards, LLMs will readily follow malicious instructions, provide unsafe advice, and generate toxic content. We introduce SimpleSafetyTests (SST) as a new test suite for rapidly and systematically identifying such critical safety risks. The test suite comprises 100 test prompts across five harm areas that LLMs, for the vast majority of applications, should refuse to comply with. We test 11 open-access and open-source LLMs and four closed-source LLMs, and find critical safety weaknesses. While some of the models do not give a single unsafe response, most give unsafe responses to more than 20% of the prompts, with over 50% unsafe responses in the extreme. Prepending a safety-emphasising system prompt substantially reduces the occurrence of unsafe responses, but does not completely stop them from happening. Trained annotators labelled every model response to SST (n = 3,000). We use these annotations to evaluate five AI safety filters (which assess whether a models' response is unsafe given a prompt) as a way of automatically evaluating models' performance on SST. The filters' performance varies considerably. There are also differences across the five harm areas, and on the unsafe versus safe responses. The widely-used Perspective API has 72% accuracy and a newly-created zero-shot prompt to OpenAI's GPT-4 performs best with 89% accuracy. Content Warning: This paper contains prompts and responses that relate to child abuse, suicide, self-harm and eating disorders, scams and fraud, illegal items, and physical harm.
FinanceBench: A New Benchmark for Financial Question Answering
Islam, Pranab, Kannappan, Anand, Kiela, Douwe, Qian, Rebecca, Scherrer, Nino, Vidgen, Bertie
FinanceBench is a first-of-its-kind test suite for evaluating the performance of LLMs on open book financial question answering (QA). It comprises 10,231 questions about publicly traded companies, with corresponding answers and evidence strings. The questions in FinanceBench are ecologically valid and cover a diverse set of scenarios. They are intended to be clear-cut and straightforward to answer to serve as a minimum performance standard. We test 16 state of the art model configurations (including GPT-4-Turbo, Llama2 and Claude2, with vector stores and long context prompts) on a sample of 150 cases from FinanceBench, and manually review their answers (n=2,400). The cases are available open-source. We show that existing LLMs have clear limitations for financial QA. Notably, GPT-4-Turbo used with a retrieval system incorrectly answered or refused to answer 81% of questions. While augmentation techniques such as using longer context window to feed in relevant evidence improve performance, they are unrealistic for enterprise settings due to increased latency and cannot support larger financial documents. We find that all models examined exhibit weaknesses, such as hallucinations, that limit their suitability for use by enterprises.
Step by Step to Fairness: Attributing Societal Bias in Task-oriented Dialogue Systems
Su, Hsuan, Qian, Rebecca, Sankar, Chinnadhurai, Shayandeh, Shahin, Chen, Shang-Tse, Lee, Hung-yi, Bikel, Daniel M.
Recent works have shown considerable improvements in task-oriented dialogue (TOD) systems by utilizing pretrained large language models (LLMs) in an end-to-end manner. However, the biased behavior of each component in a TOD system and the error propagation issue in the end-to-end framework can lead to seriously biased TOD responses. Existing works of fairness only focus on the total bias of a system. In this paper, we propose a diagnosis method to attribute bias to each component of a TOD system. With the proposed attribution method, we can gain a deeper understanding of the sources of bias. Additionally, researchers can mitigate biased model behavior at a more granular level. We conduct experiments to attribute the TOD system's bias toward three demographic axes: gender, age, and race. Experimental results show that the bias of a TOD system usually comes from the response generation model.
Many Episode Learning in a Modular Embodied Agent via End-to-End Interaction
Sun, Yuxuan, Carlson, Ethan, Qian, Rebecca, Srinet, Kavya, Szlam, Arthur
In this work we give a case study of a modular embodied machine-learning (ML) powered agent that improves itself via interactions with crowd-workers. The agent consists of a set of modules, some of which are learned, and others heuristic. While the agent is not "end-to-end" in the ML sense, end-to-end interaction with humans and its environment is a vital part of the agent's learning mechanism. We describe how the design of the agent works together with the design of multiple annotation interfaces to allow crowd-workers to assign credit to module errors from these end-toend interactions, and to label data for an individual module. We further show how this whole loop (including model re-training and re-deployment) can be automated. Over multiple loops with crowdsourced humans with no knowledge of the agent architecture, we demonstrate improvement over the agent's language understanding and visual perception modules. Present day machine learning (ML) research prioritizes end-to-end learning. Not only are end-to-end models able to achieve excellent performance on static tasks, there is a growing literature on how to adapt pre-trained networks to new tasks, and large pre-trained models can have impressive zero-shot performance on unseen tasks. In the setting of embodied agents, this manifests as agents actualized as monolithic ML models, where inputs to the model are the agent's perceptual sensors, and the model's outputs directly control agent actions. There are now a number of environments designed for the training of end-to-end embodied agents Beattie et al. (2016); Savva et al. (2019); Guss et al. (2019); Petrenko et al. (2021), and there is hope (and some evidence) that the same sort of transfer and adaptability seen in language and vision models will carry over to the embodied agent setting. Nevertheless, agents implemented as fully end-to-end ML models are rare in production systems (or in real-world embodied agents, a.k.a. While this in part is a symptom of the rapid improvement and scaling in the literature and the lag in technology transfer, these systems require performance and safety guarantees that are still not easily obtainable from end-to-end ML models; and must be maintainable by human engineers.
Human Evaluation of Conversations is an Open Problem: comparing the sensitivity of various methods for evaluating dialogue agents
Smith, Eric Michael, Hsu, Orion, Qian, Rebecca, Roller, Stephen, Boureau, Y-Lan, Weston, Jason
At the heart of improving conversational AI is the open problem of how to evaluate conversations. Issues with automatic metrics are well known (Liu et al., 2016, arXiv:1603.08023), with human evaluations still considered the gold standard. Unfortunately, how to perform human evaluations is also an open problem: differing data collection methods have varying levels of human agreement and statistical sensitivity, resulting in differing amounts of human annotation hours and labor costs. In this work we compare five different crowdworker-based human evaluation methods and find that different methods are best depending on the types of models compared, with no clear winner across the board. While this highlights the open problems in the area, our analysis leads to advice of when to use which one, and possible future directions.
droidlet: modular, heterogenous, multi-modal agents
Pratik, Anurag, Chintala, Soumith, Srinet, Kavya, Gandhi, Dhiraj, Qian, Rebecca, Sun, Yuxuan, Drew, Ryan, Elkafrawy, Sara, Tiwari, Anoushka, Hart, Tucker, Williamson, Mary, Gupta, Abhinav, Szlam, Arthur
In recent years, there have been significant advances in building end-to-end Machine Learning (ML) systems that learn at scale. But most of these systems are: (a) isolated (perception, speech, or language only); (b) trained on static datasets. On the other hand, in the field of robotics, large-scale learning has always been difficult. Supervision is hard to gather and real world physical interactions are expensive. In this work we introduce and open-source droidlet, a modular, heterogeneous agent architecture and platform. It allows us to exploit both large-scale static datasets in perception and language and sophisticated heuristics often used in robotics; and provides tools for interactive annotation. Furthermore, it brings together perception, language and action onto one platform, providing a path towards agents that learn from the richness of real world interactions.