evidence sentence
Can LLMs Estimate Cognitive Complexity of Reading Comprehension Items?
Hwang, Seonjeong, Kim, Hyounghun, Lee, Gary Geunbae
Estimating the cognitive complexity of reading comprehension (RC) items is crucial for assessing item difficulty before it is administered to learners. Unlike syntactic and semantic features, such as passage length or semantic similarity between options, cognitive features that arise during answer reasoning are not readily extractable using existing NLP tools and have traditionally relied on human annotation. In this study, we examine whether large language models (LLMs) can estimate the cognitive complexity of RC items by focusing on two dimensions-Evidence Scope and Transformation Level-that indicate the degree of cognitive burden involved in reasoning about the answer. Our experimental results demonstrate that LLMs can approximate the cognitive complexity of items, indicating their potential as tools for prior difficulty analysis. Further analysis reveals a gap between LLMs' reasoning ability and their metacognitive awareness: even when they produce correct answers, they sometimes fail to correctly identify the features underlying their own reasoning process.
- Asia > China (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > Middle East > Malta > Port Region > Southern Harbour District > Floriana (0.04)
- Education > Educational Setting (0.94)
- Education > Assessment & Standards > Student Performance (0.71)
Improving the fact-checking performance of language models by relying on their entailment ability
Kumar, Gaurav, Mazumder, Debajyoti, Garg, Ayush, Patro, Jasabanta
Automated fact-checking has been a challenging task for the research community. Past works tried various strategies, such as end-to-end training, retrieval-augmented generation, and prompt engineering, to build robust fact-checking systems. However, their accuracy has not been very high for real-world deployment. We, on the other hand, propose a simple yet effective strategy, where entailed justifications generated by LLMs are used to train encoder-only language models (ELMs) for fact-checking. We conducted a rigorous set of experiments, comparing our approach with recent works and various prompting and fine-tuning strategies to demonstrate the superiority of our approach. Additionally, we did quality analysis of model explanations, ablation studies, and error analysis to provide a comprehensive understanding of our approach.
- North America > United States > New York > New York County > New York City (0.04)
- Asia > India > Madhya Pradesh > Bhopal (0.04)
- North America > United States > Ohio > Franklin County > Columbus (0.04)
- (17 more...)
- Workflow (0.92)
- Research Report > New Finding (0.67)
- Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
- Law (1.00)
- Health & Medicine (1.00)
- (6 more...)
How Grounded is Wikipedia? A Study on Structured Evidential Support and Retrieval
Walden, William, Ricci, Kathryn, Wanner, Miriam, Jiang, Zhengping, May, Chandler, Zhou, Rongkun, Van Durme, Benjamin
Wikipedia is a critical resource for modern NLP, serving as a rich repository of up-to-date and citation-backed information on a wide variety of subjects. The reliability of Wikipedia -- its groundedness in its cited sources -- is vital to this purpose. This work analyzes both how grounded Wikipedia is and how readily fine-grained grounding evidence can be retrieved. To this end, we introduce PeopleProfiles -- a large-scale, multi-level dataset of claim support annotations on biographical Wikipedia articles. We show that: (1) ~22% of claims in Wikipedia lead sections are unsupported by the article body; (2) ~30% of claims in the article body are unsupported by their publicly accessible sources; and (3) real-world Wikipedia citation practices often differ from documented standards. Finally, we show that complex evidence retrieval remains a challenge -- even for recent reasoning rerankers.
- Asia > Singapore (0.04)
- Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
- North America > United States > Washington > King County > Seattle (0.04)
- (11 more...)
- Health & Medicine (0.46)
- Media > Film (0.46)
- Leisure & Entertainment (0.46)
Utilizing LLMs to Investigate the Disputed Role of Evidence in Electronic Cigarette Health Policy Formation in Australia and the UK
Curran, Damian, Chapman, Brian, Conway, Mike
Australia and the UK have developed contrasting approaches to the regulation of electronic cigarettes, with - broadly speaking - Australia adopting a relatively restrictive approach and the UK adopting a more permissive approach. Notably, these divergent policies were developed from the same broad evidence base. In this paper, to investigate differences in how the two jurisdictions manage and present evidence, we developed and evaluated a Large Language Model-based sentence classifier to perform automated analyses of electronic cigarette-related policy documents drawn from official Australian and UK legislative processes (109 documents in total). Specifically, we utilized GPT-4 to automatically classify sentences based on whether they contained claims that e-cigarettes were broadly helpful or harmful for public health. Our LLM-based classifier achieved an F-score of 0.9. Further, when applying the classifier to our entire sentence-level corpus, we found that Australian legislative documents show a much higher proportion of harmful statements, and a lower proportion of helpful statements compared to the expected values, with the opposite holding for the UK. In conclusion, this work utilized an LLM-based approach to provide evidence to support the contention that - drawing on the same evidence base - Australian ENDS-related policy documents emphasize the harms associated with ENDS products and UK policy documents emphasize the benefits. Further, our approach provides a starting point for using LLM-based methods to investigate the complex relationship between evidence and health policy formation.
- Oceania > New Zealand (0.05)
- Oceania > Australia > Victoria > Melbourne (0.05)
- North America > United States > Virginia (0.05)
- (4 more...)
CDER: Collaborative Evidence Retrieval for Document-level Relation Extraction
Document-level Relation Extraction (DocRE) involves identifying relations between entities across multiple sentences in a document. Evidence sentences, crucial for precise entity pair relationships identification, enhance focus on essential text segments, improving DocRE performance. However, existing evidence retrieval systems often overlook the collaborative nature among semantically similar entity pairs in the same document, hindering the effectiveness of the evidence retrieval task. To address this, we propose a novel evidence retrieval framework, namely CDER. CDER employs an attentional graph-based architecture to capture collaborative patterns and incorporates a dynamic sub-structure for additional robustness in evidence retrieval. Experimental results on the benchmark DocRE dataset show that CDER not only excels in the evidence retrieval task but also enhances overall performance of existing DocRE system.
- Europe > Germany (0.05)
- Oceania > Australia > Queensland > Brisbane (0.04)
- North America > United States > Washington > King County > Seattle (0.04)
Collapse of Dense Retrievers: Short, Early, and Literal Biases Outranking Factual Evidence
Fayyaz, Mohsen, Modarressi, Ali, Schuetze, Hinrich, Peng, Nanyun
Dense retrieval models are commonly used in Information Retrieval (IR) applications, such as Retrieval-Augmented Generation (RAG). Since they often serve as the first step in these systems, their robustness is critical to avoid failures. In this work, by repurposing a relation extraction dataset (e.g. Re-DocRED), we design controlled experiments to quantify the impact of heuristic biases, such as favoring shorter documents, in retrievers like Dragon+ and Contriever. Our findings reveal significant vulnerabilities: retrievers often rely on superficial patterns like over-prioritizing document beginnings, shorter documents, repeated entities, and literal matches. Additionally, they tend to overlook whether the document contains the query's answer, lacking deep semantic understanding. Notably, when multiple biases combine, models exhibit catastrophic performance degradation, selecting the answer-containing document in less than 3% of cases over a biased document without the answer. Furthermore, we show that these biases have direct consequences for downstream applications like RAG, where retrieval-preferred documents can mislead LLMs, resulting in a 34% performance drop than not providing any documents at all.
- North America > United States > New York (0.28)
- North America > Canada (0.28)
- Europe > Italy (0.28)
- (5 more...)
- Research Report > Experimental Study (1.00)
- Research Report > New Finding (0.88)
Say Less, Mean More: Leveraging Pragmatics in Retrieval-Augmented Generation
Riaz, Haris, Riloff, Ellen, Surdeanu, Mihai
We propose a simple, unsupervised method that injects pragmatic principles in retrieval-augmented generation (RAG) frameworks such as Dense Passage Retrieval to enhance the utility of retrieved contexts. Our approach first identifies which sentences in a pool of documents retrieved by RAG are most relevant to the question at hand, cover all the topics addressed in the input question and no more, and then highlights these sentences within their context, before they are provided to the LLM, without truncating or altering the context in any other way. We show that this simple idea brings consistent improvements in experiments on three question answering tasks (ARC-Challenge, PubHealth and PopQA) using five different LLMs. It notably enhances relative accuracy by up to 19.7% on PubHealth and 10% on ARC-Challenge compared to a conventional RAG system.
- Asia (0.14)
- North America > United States > Louisiana (0.14)
- North America > United States > Arizona > Pima County > Tucson (0.14)
- Health & Medicine (1.00)
- Education (0.68)
- Government > Regional Government > North America Government > United States Government (0.67)
SelfElicit: Your Language Model Secretly Knows Where is the Relevant Evidence
Liu, Zhining, Amjad, Rana Ali, Adkathimar, Ravinarayana, Wei, Tianxin, Tong, Hanghang
Providing Language Models (LMs) with relevant evidence in the context (either via retrieval or user-provided) can significantly improve their ability to provide factually correct grounded responses. However, recent studies have found that LMs often struggle to fully comprehend and utilize key evidence from the context, especially when it contains noise and irrelevant information - an issue common in real-world scenarios. To address this, we propose SelfElicit, an inference-time approach that helps LMs focus on key contextual evidence through self-guided explicit highlighting. By leveraging the inherent evidence-finding capabilities of LMs using the attention scores of deeper layers, our method automatically identifies and emphasizes key evidence within the input context, facilitating more accurate and factually grounded responses without additional training or iterative prompting. We demonstrate that SelfElicit brings consistent and significant improvement on multiple evidence-based QA tasks for various LM families while maintaining computational efficiency. Our code and documentation are available at https://github.com/ZhiningLiu1998/SelfElicit.
- Oceania > Australia > South Australia (0.15)
- North America > United States > California > San Francisco County > San Francisco (0.14)
- Europe > Austria > Vienna (0.14)
- (29 more...)
- Leisure & Entertainment > Sports > Soccer (1.00)
- Media > Music (0.68)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.94)