partial answer
AA-Omniscience: Evaluating Cross-Domain Knowledge Reliability in Large Language Models
Jackson, Declan, Keating, William, Cameron, George, Hill-Smith, Micah
We introduce AA-Omniscience, a benchmark designed to measure both factual recall and knowledge calibration across 6,000 questions. Questions are derived from authoritative academic and industry sources, and cover 42 economically relevant topics within six different domains. The evaluation measures a model's Omniscience Index, a bounded metric (-100 to 100) measuring factual recall that jointly penalizes hallucinations and rewards abstention when uncertain, with 0 equating to a model that answers questions correctly as much as it does incorrectly. Among evaluated models, Claude 4.1 Opus attains the highest score (4.8), making it one of only three models to score above zero. These results reveal persistent factuality and calibration weaknesses across frontier models. Performance also varies by domain, with the models from three different research labs leading across the six domains. This performance variability suggests models should be chosen according to the demands of the use case rather than general performance for tasks where knowledge is important.
StepChain GraphRAG: Reasoning Over Knowledge Graphs for Multi-Hop Question Answering
Ni, Tengjun, Yuan, Xin, Li, Shenghong, Wu, Kai, Liu, Ren Ping, Ni, Wei, Zhang, Wenjie
Recent progress in retrieval-augmented generation (RAG) has led to more accurate and interpretable multi-hop question answering (QA). Yet, challenges persist in integrating iterative reasoning steps with external knowledge retrieval. To address this, we introduce StepChain GraphRAG, a framework that unites question decomposition with a Breadth-First Search (BFS) Reasoning Flow for enhanced multi-hop QA. Our approach first builds a global index over the corpus; at inference time, only retrieved passages are parsed on-the-fly into a knowledge graph, and the complex query is split into sub-questions. For each sub-question, a BFS-based traversal dynamically expands along relevant edges, assembling explicit evidence chains without overwhelming the language model with superfluous context. Experiments on MuSiQue, 2WikiMultiHopQA, and HotpotQA show that StepChain GraphRAG achieves state-of-the-art Exact Match and F1 scores. StepChain GraphRAG lifts average EM by 2.57% and F1 by 2.13% over the SOTA method, achieving the largest gain on HotpotQA (+4.70% EM, +3.44% F1). StepChain GraphRAG also fosters enhanced explainability by preserving the chain-of-thought across intermediate retrieval steps. We conclude by discussing how future work can mitigate the computational overhead and address potential hallucinations from large language models to refine efficiency and reliability in multi-hop QA.
Mind the Generation Process: Fine-Grained Confidence Estimation During LLM Generation
Han, Jinyi, Li, Tingyun, Chen, Shisong, Shi, Jie, Wang, Xinyi, Yue, Guanglei, Liang, Jiaqing, Lin, Xin, Wen, Liqian, Chen, Zulong, Xiao, Yanghua
While large language models (LLMs) have demonstrated remarkable performance across diverse tasks, they fundamentally lack self-awareness and frequently exhibit overconfidence, assigning high confidence scores to incorrect predictions. Accurate confidence estimation is therefore critical for enhancing the trustworthiness and reliability of LLM-generated outputs. However, existing approaches suffer from coarse-grained scoring mechanisms that fail to provide fine-grained, continuous confidence estimates throughout the generation process. To address these limitations, we introduce FineCE, a novel confidence estimation method that delivers accurate, fine-grained confidence scores during text generation. Specifically, we first develop a comprehensive pipeline for constructing training data that effectively captures the underlying probabilistic distribution of LLM responses, and then train a model to predict confidence scores for arbitrary text sequences in a supervised manner. Furthermore, we propose a Backward Confidence Integration (BCI) strategy that leverages information from the subsequent text to enhance confidence estimation for the current sequence during inference. We also introduce three strategies for identifying optimal positions to perform confidence estimation within the generation process. Extensive experiments on multiple benchmark datasets demonstrate that FineCE consistently outperforms existing classical confidence estimation methods. Our code and all baselines used in the paper are available on GitHub.
ReAgent: Reversible Multi-Agent Reasoning for Knowledge-Enhanced Multi-Hop QA
Xinjie, Zhao, Gao, Fan, Yang, Rui, Chen, Yingjian, Wang, Yuyang, Zhu, Ying, Tang, Jiacheng, Li, Irene
Recent advances in large language models (LLMs) have significantly improved multi-hop question answering (QA) through direct Chain-of-Thought (CoT) reasoning. However, the irreversible nature of CoT leads to error accumulation, making it challenging to correct mistakes in multi-hop reasoning. This paper introduces ReAgent: a Reversible multi-Agent collaborative framework augmented with explicit backtracking mechanisms, enabling reversible multi-hop reasoning. By incorporating text-based retrieval, information aggregation and validation, our system can detect and correct errors mid-reasoning, leading to more robust and interpretable QA outcomes. The framework and experiments serve as a foundation for future work on error-tolerant QA systems. Empirical evaluations across three benchmarks indicate ReAgent's efficacy, yielding average about 6\% improvements against baseline models.
DebateQA: Evaluating Question Answering on Debatable Knowledge
Xu, Rongwu, Qi, Xuan, Qi, Zehan, Xu, Wei, Guo, Zhijiang
The rise of large language models (LLMs) has enabled us to seek answers to inherently debatable questions on LLM chatbots, necessitating a reliable way to evaluate their ability. However, traditional QA benchmarks assume fixed answers are inadequate for this purpose. To address this, we introduce DebateQA, a dataset of 2,941 debatable questions, each accompanied by multiple human-annotated partial answers that capture a variety of perspectives. We develop two metrics: Perspective Diversity, which evaluates the comprehensiveness of perspectives, and Dispute Awareness, which assesses if the LLM acknowledges the question's debatable nature. Experiments demonstrate that both metrics align with human preferences and are stable across different underlying models. Using DebateQA with two metrics, we assess 12 popular LLMs and retrieval-augmented generation methods. Our findings reveal that while LLMs generally excel at recognizing debatable issues, their ability to provide comprehensive answers encompassing diverse perspectives varies considerably.
DiscASP: A Graph-based ASP System for Finding Relevant Consistent Concepts with Applications to Conversational Socialbots
Li, Fang, Wang, Huaduo, Basu, Kinjal, Salazar, Elmer, Gupta, Gopal
We consider the problem of finding relevant consistent concepts in a conversational AI system, particularly, for realizing a conversational socialbot. Commonsense knowledge about various topics can be represented as an answer set program. However, to advance the conversation, we need to solve the problem of finding relevant consistent concepts, i.e., find consistent knowledge in the "neighborhood" of the current topic being discussed that can be used to advance the conversation. Traditional ASP solvers will generate the whole answer set which is stripped of all the associations between the various atoms (concepts) and thus cannot be used to find relevant consistent concepts. Similarly, goal-directed implementations of ASP will only find concepts directly relevant to a query. We present the DiscASP system that will find the partial consistent model that is relevant to a given topic in a manner similar to how a human will find it. DiscASP is based on a novel graph-based algorithm for finding stable models of an answer set program. We present the DiscASP algorithm, its implementation, and its application to developing a conversational socialbot.
Evaluating Conversational Characters Created through Question Generation
Chen, Grace (California State University Long Beach) | Tosch, Emma (Brandeis University) | Artstein, Ron (USC Institute for Creative Technologies) | Leuski, Anton ( USC Institute for Creative Technologies ) | Traum, David ( USC Institute for Creative Technologies )
Question generation tools can be used to extract a question-answer database from text articles. We investigate how suitable this technique is for giving domain-specific knowledge to conversational characters. We tested these characters by collecting questions and answers from naive participants, running the questions through the character, and comparing the system responses to the participant answers. Characters gave a full or partial answer to 53% of the user questions which had an answer available in the source text, and 43% of all questions asked. Performance was better for questions asked after the user had read the source text, and also varied by question type: the best results were answers to who questions, while answers to yes/no questions were among the poorer performers. The results show that question generation is a promising method for creating a question answering conversational character from an existing text.