AITopics | Qian, Rebecca

Collaborating Authors

Qian, Rebecca

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Browsing Lost Unformed Recollections: A Benchmark for Tip-of-the-Tongue Search and Reasoning

CH-Wang, Sky, Deshpande, Darshan, Muresan, Smaranda, Kannappan, Anand, Qian, Rebecca

arXiv.org Artificial IntelligenceMar-24-2025

We introduce Browsing Lost Unformed Recollections, a tip-of-the-tongue known-item search and reasoning benchmark for general AI assistants. BLUR introduces a set of 573 real-world validated questions that demand searching and reasoning across multi-modal and multilingual inputs, as well as proficient tool use, in order to excel on. Humans easily ace these questions (scoring on average 98%), while the best-performing system scores around 56%. To facilitate progress toward addressing this challenging and aspirational use case for general AI assistants, we release 350 questions through a public leaderboard, retain the answers to 250 of them, and have the rest as a private test set.

information retrieval, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2503.19193

Country: Asia (0.68)

Genre: Research Report (0.82)

Industry:

Media (1.00)
Leisure & Entertainment (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
(2 more...)

Add feedback

GLIDER: Grading LLM Interactions and Decisions using Explainable Ranking

Deshpande, Darshan, Ravi, Selvan Sunitha, CH-Wang, Sky, Mielczarek, Bartosz, Kannappan, Anand, Qian, Rebecca

arXiv.org Artificial IntelligenceDec-20-2024

The LLM-as-judge paradigm is increasingly being adopted for automated evaluation of model outputs. While LLM judges have shown promise on constrained evaluation tasks, closed source LLMs display critical shortcomings when deployed in real world applications due to challenges of fine grained metrics and explainability, while task specific evaluation models lack cross-domain generalization. We introduce GLIDER, a powerful 3B evaluator LLM that can score any text input and associated context on arbitrary user defined criteria. GLIDER shows higher Pearson's correlation than GPT-4o on FLASK and greatly outperforms prior evaluation models, achieving comparable performance to LLMs 17x its size. GLIDER supports fine-grained scoring, multilingual reasoning, span highlighting and was trained on 685 domains and 183 criteria. Extensive qualitative analysis shows that GLIDER scores are highly correlated with human judgments, with 91.3% human agreement. We have open-sourced GLIDER to facilitate future research.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2412.1414

Country:

North America > United States (0.45)
North America > Mexico (0.28)

Genre: Research Report > New Finding (0.46)

Industry:

Transportation > Passenger (1.00)
Transportation > Air (1.00)
Law > Intellectual Property & Technology Law (1.00)
(9 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Lynx: An Open Source Hallucination Evaluation Model

Ravi, Selvan Sunitha, Mielczarek, Bartosz, Kannappan, Anand, Kiela, Douwe, Qian, Rebecca

arXiv.org Artificial IntelligenceJul-11-2024

Retrieval Augmented Generation (RAG) techniques aim to mitigate hallucinations in Large Language Models (LLMs). However, LLMs can still produce information that is unsupported or contradictory to the retrieved contexts. We introduce LYNX, a SOTA hallucination detection LLM that is capable of advanced reasoning on challenging real-world hallucination scenarios. To evaluate LYNX, we present HaluBench, a comprehensive hallucination evaluation benchmark, consisting of 15k samples sourced from various real-world domains. Our experiment results show that LYNX outperforms GPT-4o, Claude-3-Sonnet, and closed and open-source LLM-as-a-judge models on HaluBench. We release LYNX, HaluBench and our evaluation code for public access.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2407.08488

Country:

North America > United States > New York (0.29)
North America > United States > Washington > King County > Seattle (0.14)
North America > United States > Missouri > Jackson County > Kansas City (0.14)

Genre: Research Report > New Finding (0.34)

Industry:

Health & Medicine > Therapeutic Area (0.47)
Leisure & Entertainment > Sports (0.47)
Health & Medicine > Public Health (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

SimpleSafetyTests: a Test Suite for Identifying Critical Safety Risks in Large Language Models

Vidgen, Bertie, Scherrer, Nino, Kirk, Hannah Rose, Qian, Rebecca, Kannappan, Anand, Hale, Scott A., Röttger, Paul

arXiv.org Artificial IntelligenceFeb-16-2024

The past year has seen rapid acceleration in the development of large language models (LLMs). However, without proper steering and safeguards, LLMs will readily follow malicious instructions, provide unsafe advice, and generate toxic content. We introduce SimpleSafetyTests (SST) as a new test suite for rapidly and systematically identifying such critical safety risks. The test suite comprises 100 test prompts across five harm areas that LLMs, for the vast majority of applications, should refuse to comply with. We test 11 open-access and open-source LLMs and four closed-source LLMs, and find critical safety weaknesses. While some of the models do not give a single unsafe response, most give unsafe responses to more than 20% of the prompts, with over 50% unsafe responses in the extreme. Prepending a safety-emphasising system prompt substantially reduces the occurrence of unsafe responses, but does not completely stop them from happening. Trained annotators labelled every model response to SST (n = 3,000). We use these annotations to evaluate five AI safety filters (which assess whether a models' response is unsafe given a prompt) as a way of automatically evaluating models' performance on SST. The filters' performance varies considerably. There are also differences across the five harm areas, and on the unsafe versus safe responses. The widely-used Perspective API has 72% accuracy and a newly-created zero-shot prompt to OpenAI's GPT-4 performs best with 89% accuracy. Content Warning: This paper contains prompts and responses that relate to child abuse, suicide, self-harm and eating disorders, scams and fraud, illegal items, and physical harm.

large language model, machine learning, system prompt, (19 more...)

arXiv.org Artificial Intelligence

2311.0837

Country:

Europe (0.46)
North America > United States (0.46)

Genre: Research Report (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

FinanceBench: A New Benchmark for Financial Question Answering

Islam, Pranab, Kannappan, Anand, Kiela, Douwe, Qian, Rebecca, Scherrer, Nino, Vidgen, Bertie

arXiv.org Machine LearningNov-20-2023

FinanceBench is a first-of-its-kind test suite for evaluating the performance of LLMs on open book financial question answering (QA). It comprises 10,231 questions about publicly traded companies, with corresponding answers and evidence strings. The questions in FinanceBench are ecologically valid and cover a diverse set of scenarios. They are intended to be clear-cut and straightforward to answer to serve as a minimum performance standard. We test 16 state of the art model configurations (including GPT-4-Turbo, Llama2 and Claude2, with vector stores and long context prompts) on a sample of 150 cases from FinanceBench, and manually review their answers (n=2,400). The cases are available open-source. We show that existing LLMs have clear limitations for financial QA. Notably, GPT-4-Turbo used with a retrieval system incorrectly answered or refused to answer 81% of questions. While augmentation techniques such as using longer context window to feed in relevant evidence improve performance, they are unrealistic for enterprise settings due to increased latency and cannot support larger financial documents. We find that all models examined exhibit weaknesses, such as hallucinations, that limit their suitability for use by enterprises.

information, machine learning, question answering, (20 more...)

arXiv.org Machine Learning

2311.11944

Country:

Europe (0.67)
Asia > Middle East > UAE (0.14)
North America > United States > Texas (0.14)

Genre: Research Report (0.84)

Industry:

Banking & Finance (1.00)
Consumer Products & Services > Food, Beverage, Tobacco & Cannabis (0.68)
Information Technology > Services (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Step by Step to Fairness: Attributing Societal Bias in Task-oriented Dialogue Systems

Su, Hsuan, Qian, Rebecca, Sankar, Chinnadhurai, Shayandeh, Shahin, Chen, Shang-Tse, Lee, Hung-yi, Bikel, Daniel M.

arXiv.org Artificial IntelligenceNov-14-2023

Recent works have shown considerable improvements in task-oriented dialogue (TOD) systems by utilizing pretrained large language models (LLMs) in an end-to-end manner. However, the biased behavior of each component in a TOD system and the error propagation issue in the end-to-end framework can lead to seriously biased TOD responses. Existing works of fairness only focus on the total bias of a system. In this paper, we propose a diagnosis method to attribute bias to each component of a TOD system. With the proposed attribution method, we can gain a deeper understanding of the sources of bias. Additionally, researchers can mitigate biased model behavior at a more granular level. We conduct experiments to attribute the TOD system's bias toward three demographic axes: gender, age, and race. Experimental results show that the bias of a TOD system usually comes from the response generation model.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2311.06513

Country:

Europe (0.68)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.88)
Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Many Episode Learning in a Modular Embodied Agent via End-to-End Interaction

Sun, Yuxuan, Carlson, Ethan, Qian, Rebecca, Srinet, Kavya, Szlam, Arthur

arXiv.org Artificial IntelligenceJan-10-2023

In this work we give a case study of a modular embodied machine-learning (ML) powered agent that improves itself via interactions with crowd-workers. The agent consists of a set of modules, some of which are learned, and others heuristic. While the agent is not "end-to-end" in the ML sense, end-to-end interaction with humans and its environment is a vital part of the agent's learning mechanism. We describe how the design of the agent works together with the design of multiple annotation interfaces to allow crowd-workers to assign credit to module errors from these end-toend interactions, and to label data for an individual module. We further show how this whole loop (including model re-training and re-deployment) can be automated. Over multiple loops with crowdsourced humans with no knowledge of the agent architecture, we demonstrate improvement over the agent's language understanding and visual perception modules. Present day machine learning (ML) research prioritizes end-to-end learning. Not only are end-to-end models able to achieve excellent performance on static tasks, there is a growing literature on how to adapt pre-trained networks to new tasks, and large pre-trained models can have impressive zero-shot performance on unseen tasks. In the setting of embodied agents, this manifests as agents actualized as monolithic ML models, where inputs to the model are the agent's perceptual sensors, and the model's outputs directly control agent actions. There are now a number of environments designed for the training of end-to-end embodied agents Beattie et al. (2016); Savva et al. (2019); Guss et al. (2019); Petrenko et al. (2021), and there is hope (and some evidence) that the same sort of transfer and adaptability seen in language and vision models will carry over to the embodied agent setting. Nevertheless, agents implemented as fully end-to-end ML models are rare in production systems (or in real-world embodied agents, a.k.a. While this in part is a symptom of the rapid improvement and scaling in the literature and the lag in technology transfer, these systems require performance and safety guarantees that are still not easily obtainable from end-to-end ML models; and must be maintainable by human engineers.

agent, artificial intelligence, machine learning, (14 more...)

arXiv.org Artificial Intelligence

2204.08687

Genre: Research Report (0.50)

Industry: Leisure & Entertainment > Games (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Human Evaluation of Conversations is an Open Problem: comparing the sensitivity of various methods for evaluating dialogue agents

Smith, Eric Michael, Hsu, Orion, Qian, Rebecca, Roller, Stephen, Boureau, Y-Lan, Weston, Jason

arXiv.org Artificial IntelligenceJan-12-2022

At the heart of improving conversational AI is the open problem of how to evaluate conversations. Issues with automatic metrics are well known (Liu et al., 2016, arXiv:1603.08023), with human evaluations still considered the gold standard. Unfortunately, how to perform human evaluations is also an open problem: differing data collection methods have varying levels of human agreement and statistical sensitivity, resulting in differing amounts of human annotation hours and labor costs. In this work we compare five different crowdworker-based human evaluation methods and find that different methods are best depending on the types of models compared, with no clear winner across the board. While this highlights the open problems in the area, our analysis leads to advice of when to use which one, and possible future directions.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2201.04723

Country:

North America > United States (0.46)
Europe (0.46)

Genre: Research Report > Experimental Study (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (0.67)

Add feedback

droidlet: modular, heterogenous, multi-modal agents

Pratik, Anurag, Chintala, Soumith, Srinet, Kavya, Gandhi, Dhiraj, Qian, Rebecca, Sun, Yuxuan, Drew, Ryan, Elkafrawy, Sara, Tiwari, Anoushka, Hart, Tucker, Williamson, Mary, Gupta, Abhinav, Szlam, Arthur

arXiv.org Artificial IntelligenceJan-25-2021

In recent years, there have been significant advances in building end-to-end Machine Learning (ML) systems that learn at scale. But most of these systems are: (a) isolated (perception, speech, or language only); (b) trained on static datasets. On the other hand, in the field of robotics, large-scale learning has always been difficult. Supervision is hard to gather and real world physical interactions are expensive. In this work we introduce and open-source droidlet, a modular, heterogeneous agent architecture and platform. It allows us to exploit both large-scale static datasets in perception and language and sophisticated heuristics often used in robotics; and provides tools for interactive annotation. Furthermore, it brings together perception, language and action onto one platform, providing a path towards agents that learn from the richness of real world interactions.

droidlet, multi-modal agent

arXiv.org Artificial Intelligence

2101.10384

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.53)
Information Technology > Artificial Intelligence > Machine Learning (0.53)

Add feedback