Lu, Xiaoxin
AAAR-1.0: Assessing AI's Potential to Assist Research
Lou, Renze, Xu, Hanzi, Wang, Sijia, Du, Jiangshu, Kamoi, Ryo, Lu, Xiaoxin, Xie, Jian, Sun, Yuxuan, Zhang, Yusen, Ahn, Jihyun Janice, Fang, Hongchao, Zou, Zhuoyang, Ma, Wenchao, Li, Xi, Zhang, Kai, Xia, Congying, Huang, Lifu, Yin, Wenpeng
Numerous studies have assessed the proficiency of AI systems, particularly large language models (LLMs), in facilitating everyday tasks such as email writing, question answering, and creative content generation. However, researchers face unique challenges and opportunities in leveraging LLMs for their own work, such as brainstorming research ideas, designing experiments, and writing or reviewing papers. In this study, we introduce AAAR-1.0, a benchmark dataset designed to evaluate LLM performance in three fundamental, expertise-intensive research tasks: (i) EquationInference, assessing the correctness of equations based on the contextual information in paper submissions; (ii) ExperimentDesign, designing experiments to validate research ideas and solutions; (iii) PaperWeakness, identifying weaknesses in paper submissions; and (iv) REVIEWCRITIQUE, identifying each segment in human reviews is deficient or not. AAAR-1.0 differs from prior benchmarks in two key ways: first, it is explicitly research-oriented, with tasks requiring deep domain expertise; second, it is researcher-oriented, mirroring the primary activities that researchers engage in on a daily basis. An evaluation of both open-source and proprietary LLMs reveals their potential as well as limitations in conducting sophisticated research tasks. We will keep iterating AAAR-1.0 to new versions.
Evaluating LLMs at Detecting Errors in LLM Responses
Kamoi, Ryo, Das, Sarkar Snigdha Sarathi, Lou, Renze, Ahn, Jihyun Janice, Zhao, Yilun, Lu, Xiaoxin, Zhang, Nan, Zhang, Yusen, Zhang, Ranran Haoran, Vummanthala, Sujeeth Reddy, Dave, Salika, Qin, Shaobo, Cohan, Arman, Yin, Wenpeng, Zhang, Rui
With Large Language Models (LLMs) being widely used across various tasks, detecting errors in their responses is increasingly crucial. However, little research has been conducted on error detection of LLM responses. Collecting error annotations on LLM responses is challenging due to the subjective nature of many NLP tasks, and thus previous research focuses on tasks of little practical value (e.g., word sorting) or limited error types (e.g., faithfulness in summarization). This work introduces ReaLMistake, the first error detection benchmark consisting of objective, realistic, and diverse errors made by LLMs. ReaLMistake contains three challenging and meaningful tasks that introduce objectively assessable errors in four categories (reasoning correctness, instruction-following, context-faithfulness, and parameterized knowledge), eliciting naturally observed and diverse errors in responses of GPT-4 and Llama 2 70B annotated by experts. We use ReaLMistake to evaluate error detectors based on 12 LLMs. Our findings show: 1) Top LLMs like GPT-4 and Claude 3 detect errors made by LLMs at very low recall, and all LLM-based error detectors perform much worse than humans.
Fair Abstractive Summarization of Diverse Perspectives
Zhang, Yusen, Zhang, Nan, Liu, Yixin, Fabbri, Alexander, Liu, Junru, Kamoi, Ryo, Lu, Xiaoxin, Xiong, Caiming, Zhao, Jieyu, Radev, Dragomir, McKeown, Kathleen, Zhang, Rui
People from different social and demographic groups express diverse perspectives and conflicting opinions on a broad set of topics such as product reviews, healthcare, law, and politics. A fair summary should provide a comprehensive coverage of diverse perspectives without underrepresenting certain groups. However, current work in summarization metrics and Large Language Models (LLMs) evaluation has not explored fair abstractive summarization. In this paper, we systematically investigate fair abstractive summarization for user-generated data. We first formally define fairness in abstractive summarization as not underrepresenting perspectives of any groups of people, and we propose four reference-free automatic metrics by measuring the differences between target and source perspectives. We evaluate nine LLMs, including three GPT models, four LLaMA models, PaLM 2, and Claude, on six datasets collected from social media, online reviews, and recorded transcripts. Experiments show that both the model-generated and the human-written reference summaries suffer from low fairness. We conduct a comprehensive analysis of the common factors influencing fairness and propose three simple but effective methods to alleviate unfair summarization. Our dataset and code are available at https://github.com/psunlpgroup/FairSumm.