factuality
- North America > United States > California > Los Angeles County > Los Angeles (0.28)
- Asia > China > Beijing > Beijing (0.04)
- Asia > Singapore (0.04)
- Health & Medicine > Therapeutic Area (1.00)
- Health & Medicine > Public Health (0.97)
- Information Technology (0.93)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- Asia > Middle East > Jordan (0.04)
- North America > United States > Illinois (0.04)
- (3 more...)
TheUnreliabilityofExplanationsinFew-shot PromptingforTextualReasoning
However, text-davinci-002 is able to benefit more substantially. We further show that explanations generated by the LLMs may not entail the models' predictions norbefactually grounded intheinput, evenonsimple tasks with extractive explanations. However, these flawed explanations can still be useful as a way to verify LLMs' predictions post-hoc.
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > Louisiana (0.04)
- North America > Canada > Alberta > Census Division No. 15 > Improvement District No. 9 > Banff (0.04)
e4d2b6e6fdeca3e60e0f1a62fee3d9dd-Paper.pdf
AwidevarietyofNLPapplications, suchasmachinetranslation, summarization, and dialog, involve text generation. One major challenge for these applications is how to evaluate whether such generated texts are actually fluent, accurate, or effective. In this work, we conceptualize theevaluation of generated text as a text generation problem, modeled using pre-trained sequence-to-sequence models. The general idea is that models trained to convert the generated text to/from a reference output or the source text will achieve higher scores when the generated text is better.
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.14)
- Asia > China (0.05)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.05)
- (12 more...)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- Europe > Croatia > Dubrovnik-Neretva County > Dubrovnik (0.04)
- Asia > Middle East > Jordan (0.04)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Information Technology (0.46)
- Education (0.46)
1e89c12621c0315373f20f0aeabe5dbe-Paper-Datasets_and_Benchmarks_Track.pdf
Therearetwoupdatingstrategies: 1) mimicking strategy to generate similar samples based on original data, preserving stylistic and contextual essence, and 2) extending strategy that further expands existing samples at varying cognitive levels by adapting Bloom's taxonomy ofeducational objectives. Extensiveexperiments onupdated MMLU andBIG-Bench demonstrate thestability oftheproposed strategiesandfindthat the mimicking strategy can effectively alleviate issues of overestimation from benchmark leakage. In cases where the efficient mimicking strategy fails, our extending strategystill showspromising results.
- Asia > China (0.04)
- North America > United States > Colorado > Weld County > Evans (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- (2 more...)
BARTScore: Evaluating Generated Text as Text Generation
A wide variety of NLP applications, such as machine translation, summarization, and dialog, involve text generation. One major challenge for these applications is how to evaluate whether such generated texts are actually fluent, accurate, or effective. In this work, we conceptualize the evaluation of generated text as a text generation problem, modeled using pre-trained sequence-to-sequence models. The general idea is that models trained to convert the generated text to/from a reference output or the source text will achieve higher scores when the generated text is better. We operationalize this idea using BART, an encoder-decoder based pre-trained model, and propose a metric BARTScore with a number of variants that can be flexibly applied in an unsupervised fashion to evaluation of text from different perspectives (e.g.
Graph-based Uncertainty Metrics for Long-form Language Model Generations
Recent advancements in Large Language Models (LLMs) have significantly improved text generation capabilities, but these systems are still known to hallucinate, and granular uncertainty estimation for long-form LLM generations remains challenging. In this work, we propose Graph Uncertainty -- which represents the relationship between LLM generations and claims within them as a bipartite graph and estimates the claim-level uncertainty with a family of graph centrality metrics. Under this view, existing uncertainty estimation methods based on the concept of self-consistency can be viewed as using degree centrality as an uncertainty measure, and we show that more sophisticated alternatives such as closeness centrality provide consistent gains at claim-level uncertainty estimation.Moreover, we present uncertainty-aware decoding techniques that leverage both the graph structure and uncertainty estimates to improve the factuality of LLM generations by preserving only the most reliable claims. Compared to existing methods, our graph-based uncertainty metrics lead to an average of 6.8% relative gains on AUPRC across various long-form generation settings, and our end-to-end system provides consistent 2-4% gains in factuality over existing decoding techniques while significantly improving the informativeness of generated responses.
EduMod-LLM: A Modular Approach for Designing Flexible and Transparent Educational Assistants
Mittal, Meenakshi, Khare, Rishi, Miroyan, Mihran, Mitra, Chancharik, Norouzi, Narges
With the growing use of Large Language Model (LLM)- based Question-Answering (QA) systems in education, it is critical to evaluate their performance across individual pipeline components. In this work, we introduce EduMod-LLM, a modular function-calling LLM pipeline, and present a comprehensive evaluation along three key axes: function calling strategies, retrieval methods, and generative language models. Our framework enables fine-grained analysis by isolating and assessing each component. We benchmark function-calling performance across LLMs, compare our novel structure-aware retrieval method to vector-based and LLM-scoring baselines, and evaluate various LLMs for response synthesis. This modular approach reveals specific failure modes and performance patterns, supporting the development of interpretable and effective educational QA systems. Our findings demonstrate the value of modular function calling in improving system transparency and pedagogical alignment.
- North America > United States > California > Alameda County > Berkeley (0.04)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- Research Report > New Finding (1.00)
- Instructional Material > Course Syllabus & Notes (1.00)
- Information Technology (0.93)
- Education > Curriculum (0.68)
- Education > Educational Setting > Higher Education (0.46)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- Asia > Middle East > Jordan (0.04)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)