qampari
Measuring and Enhancing Trustworthiness of LLMs in RAG through Grounded Attributions and Learning to Refuse
Song, Maojia, Sim, Shang Hong, Bhardwaj, Rishabh, Chieu, Hai Leong, Majumder, Navonil, Poria, Soujanya
LLMs are an integral part of retrieval-augmented generation (RAG) systems. While many studies focus on evaluating the quality of end-to-end RAG systems, there is a lack of research on understanding the appropriateness of an LLM for the RAG task. Thus, we introduce a new metric, Trust-Score, that provides a holistic evaluation of the trustworthiness of LLMs in an RAG framework. We show that various prompting methods, such as in-context learning, fail to adapt LLMs effectively to the RAG task. Thus, we propose Trust-Align, a framework to align LLMs for higher Trust-Score. LLaMA-3-8b, aligned with our method, significantly outperforms open-source LLMs of comparable sizes on ASQA (up 10.7), QAMPARI (up 29.2) and ELI5 (up 14.9). We release our code at: https://github.com/declare-lab/trust-align.
- South America > Colombia (0.04)
- Europe > United Kingdom > England (0.04)
- Asia > Singapore (0.04)
- (6 more...)
- Leisure & Entertainment (1.00)
- Media (0.67)
- Health & Medicine (0.67)
Atomic Self-Consistency for Better Long Form Generations
Thirukovalluru, Raghuveer, Huang, Yukun, Dhingra, Bhuwan
Recent work has aimed to improve LLM generations by filtering out hallucinations, thereby improving the precision of the information in responses. Correctness of a long-form response, however, also depends on the recall of multiple pieces of information relevant to the question. In this paper, we introduce Atomic Self-Consistency (ASC), a technique for improving the recall of relevant information in an LLM response. ASC follows recent work, Universal Self-Consistency (USC) in using multiple stochastic samples from an LLM to improve the long-form response. Unlike USC which only focuses on selecting the best single generation, ASC picks authentic subparts from the samples and merges them into a superior composite answer. Through extensive experiments and ablations, we show that merging relevant subparts of multiple samples performs significantly better than picking a single sample. ASC demonstrates significant gains over USC on multiple factoids and open-ended QA datasets - ASQA, QAMPARI, QUEST, ELI5 with ChatGPT and Llama2. Our analysis also reveals untapped potential for enhancing long-form generations using approach of merging multiple samples.
Calibrating Long-form Generations from Large Language Models
Huang, Yukun, Liu, Yixin, Thirukovalluru, Raghuveer, Cohan, Arman, Dhingra, Bhuwan
To enhance Large Language Models' (LLMs) reliability, calibration is essential -- the model's assessed confidence scores should align with the actual likelihood of its responses being correct. However, current confidence elicitation methods and calibration metrics typically rely on a binary true/false assessment of response correctness. This approach does not apply to long-form generation, where an answer can be partially correct. Addressing this gap, we introduce a unified calibration framework, in which both the correctness of the LLMs' responses and their associated confidence levels are treated as distributions across a range of scores. Within this framework, we develop three metrics to precisely evaluate LLM calibration and further propose two confidence elicitation methods based on self-consistency and self-evaluation. Our experiments, which include long-form QA and summarization tasks, demonstrate that larger models don't necessarily guarantee better calibration, that calibration performance is found to be metric-dependent, and that self-consistency methods excel in factoid datasets. We also find that calibration can be enhanced through techniques such as fine-tuning, integrating relevant source documents, scaling the temperature, and combining self-consistency with self-evaluation. Lastly, we showcase a practical application of our system: selecting and cascading open-source models and ChatGPT to optimize correctness given a limited API budget. This research not only challenges existing notions of LLM calibration but also offers practical methodologies for improving trustworthiness in long-form generation.
- Asia > Singapore (0.04)
- Asia > Philippines > Luzon > Central Luzon > Province of Tarlac > City of Tarlac (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- (7 more...)
- Health & Medicine > Pharmaceuticals & Biotechnology (0.68)
- Media > Music (0.67)
- Leisure & Entertainment (0.67)
Training Language Models to Generate Text with Citations via Fine-grained Rewards
Huang, Chengyu, Wu, Zeqiu, Hu, Yushi, Wang, Wenya
While recent Large Language Models (LLMs) have proven useful in answering user queries, they are prone to hallucination, and their responses often lack credibility due to missing references to reliable sources. An intuitive solution to these issues would be to include in-text citations referring to external documents as evidence. While previous works have directly prompted LLMs to generate in-text citations, their performances are far from satisfactory, especially when it comes to smaller LLMs. In this work, we propose an effective training framework using fine-grained rewards to teach LLMs to generate highly supportive and relevant citations, while ensuring the correctness of their responses. We also conduct a systematic analysis of applying these fine-grained rewards to common LLM training strategies, demonstrating its advantage over conventional practices. We conduct extensive experiments on Question Answering (QA) datasets taken from the ALCE benchmark and validate the model's generalizability using EXPERTQA. On LLaMA-2-7B, the incorporation of fine-grained rewards achieves the best performance among the baselines, even surpassing that of GPT-3.5-turbo.
- Europe (0.14)
- South America > Colombia (0.04)
- Oceania > Australia (0.04)
- (5 more...)
- Media > Music (1.00)
- Media > Film (1.00)
- Leisure & Entertainment (1.00)
- (2 more...)
Crafting In-context Examples according to LMs' Parametric Knowledge
Lee, Yoonsang, Atreya, Pranav, Ye, Xi, Choi, Eunsol
In-context learning has been applied to knowledge-rich tasks such as question answering. In such scenarios, in-context examples are used to trigger a behaviour in the language model: namely, it should surface information stored in its parametric knowledge. We study the construction of in-context example sets, with a focus on the parametric knowledge of the model regarding in-context examples. We identify 'known' examples, where models can correctly answer from its parametric knowledge, and 'unknown' ones. Our experiments show that prompting with 'unknown' examples decreases the performance, potentially as it encourages hallucination rather than searching its parametric knowledge. Constructing an in-context example set that presents both known and unknown information performs the best across diverse settings. We perform analysis on three multi-answer question answering datasets, which allows us to further study answer set ordering strategies based on the LM's knowledge about each answer. Together, our study sheds lights on how to best construct in-context example sets for knowledge-rich tasks.
- Africa > South Africa (0.14)
- Asia > India (0.04)
- North America > United States > Texas > Travis County > Austin (0.04)
- (7 more...)
- Media (0.93)
- Government > Regional Government (0.93)
- Education (0.68)
- Leisure & Entertainment > Sports > Soccer (0.46)
- Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.89)
- Information Technology > Artificial Intelligence > Machine Learning > Supervised Learning (0.75)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.70)
- Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.68)