AITopics | Lu, Xiaoxin

Collaborating Authors

Lu, Xiaoxin

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

AAAR-1.0: Assessing AI's Potential to Assist Research

Lou, Renze, Xu, Hanzi, Wang, Sijia, Du, Jiangshu, Kamoi, Ryo, Lu, Xiaoxin, Xie, Jian, Sun, Yuxuan, Zhang, Yusen, Ahn, Jihyun Janice, Fang, Hongchao, Zou, Zhuoyang, Ma, Wenchao, Li, Xi, Zhang, Kai, Xia, Congying, Huang, Lifu, Yin, Wenpeng

arXiv.org Artificial IntelligenceOct-29-2024

Numerous studies have assessed the proficiency of AI systems, particularly large language models (LLMs), in facilitating everyday tasks such as email writing, question answering, and creative content generation. However, researchers face unique challenges and opportunities in leveraging LLMs for their own work, such as brainstorming research ideas, designing experiments, and writing or reviewing papers. In this study, we introduce AAAR-1.0, a benchmark dataset designed to evaluate LLM performance in three fundamental, expertise-intensive research tasks: (i) EquationInference, assessing the correctness of equations based on the contextual information in paper submissions; (ii) ExperimentDesign, designing experiments to validate research ideas and solutions; (iii) PaperWeakness, identifying weaknesses in paper submissions; and (iv) REVIEWCRITIQUE, identifying each segment in human reviews is deficient or not. AAAR-1.0 differs from prior benchmarks in two key ways: first, it is explicitly research-oriented, with tasks requiring deep domain expertise; second, it is researcher-oriented, mirroring the primary activities that researchers engage in on a daily basis. An evaluation of both open-source and proprietary LLMs reveals their potential as well as limitations in conducting sophisticated research tasks. We will keep iterating AAAR-1.0 to new versions.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2410.22394

Country:

North America > United States > Illinois (0.14)
North America > United States > California (0.14)

Genre: Research Report > New Finding (0.66)

Industry: Information Technology (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Evaluating LLMs at Detecting Errors in LLM Responses

Kamoi, Ryo, Das, Sarkar Snigdha Sarathi, Lou, Renze, Ahn, Jihyun Janice, Zhao, Yilun, Lu, Xiaoxin, Zhang, Nan, Zhang, Yusen, Zhang, Ranran Haoran, Vummanthala, Sujeeth Reddy, Dave, Salika, Qin, Shaobo, Cohan, Arman, Yin, Wenpeng, Zhang, Rui

arXiv.org Artificial IntelligenceApr-4-2024

With Large Language Models (LLMs) being widely used across various tasks, detecting errors in their responses is increasingly crucial. However, little research has been conducted on error detection of LLM responses. Collecting error annotations on LLM responses is challenging due to the subjective nature of many NLP tasks, and thus previous research focuses on tasks of little practical value (e.g., word sorting) or limited error types (e.g., faithfulness in summarization). This work introduces ReaLMistake, the first error detection benchmark consisting of objective, realistic, and diverse errors made by LLMs. ReaLMistake contains three challenging and meaningful tasks that introduce objectively assessable errors in four categories (reasoning correctness, instruction-following, context-faithfulness, and parameterized knowledge), eliciting naturally observed and diverse errors in responses of GPT-4 and Llama 2 70B annotated by experts. We use ReaLMistake to evaluate error detectors based on 12 LLMs. Our findings show: 1) Top LLMs like GPT-4 and Claude 3 detect errors made by LLMs at very low recall, and all LLM-based error detectors perform much worse than humans.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2404.03602

Country:

North America > United States (0.92)
Europe > United Kingdom > England (0.14)

Genre: Research Report > New Finding (1.00)

Industry: Leisure & Entertainment > Sports (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.95)

Add feedback

Fair Abstractive Summarization of Diverse Perspectives

Zhang, Yusen, Zhang, Nan, Liu, Yixin, Fabbri, Alexander, Liu, Junru, Kamoi, Ryo, Lu, Xiaoxin, Xiong, Caiming, Zhao, Jieyu, Radev, Dragomir, McKeown, Kathleen, Zhang, Rui

arXiv.org Artificial IntelligenceMar-29-2024

People from different social and demographic groups express diverse perspectives and conflicting opinions on a broad set of topics such as product reviews, healthcare, law, and politics. A fair summary should provide a comprehensive coverage of diverse perspectives without underrepresenting certain groups. However, current work in summarization metrics and Large Language Models (LLMs) evaluation has not explored fair abstractive summarization. In this paper, we systematically investigate fair abstractive summarization for user-generated data. We first formally define fairness in abstractive summarization as not underrepresenting perspectives of any groups of people, and we propose four reference-free automatic metrics by measuring the differences between target and source perspectives. We evaluate nine LLMs, including three GPT models, four LLaMA models, PaLM 2, and Claude, on six datasets collected from social media, online reviews, and recorded transcripts. Experiments show that both the model-generated and the human-written reference summaries suffer from low fairness. We conduct a comprehensive analysis of the common factors influencing fairness and propose three simple but effective methods to alleviate unfair summarization. Our dataset and code are available at https://github.com/psunlpgroup/FairSumm.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2311.07884

Country: North America > United States > California (0.28)

Genre: Research Report > New Finding (1.00)

Industry:

Law (1.00)
Health & Medicine (1.00)
Government > Regional Government > North America Government > United States Government (1.00)
Leisure & Entertainment (0.92)

Technology:

Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.95)

Add feedback