Goto

Collaborating Authors

 Stearns County


The Order Effect: Investigating Prompt Sensitivity in Closed-Source LLMs

arXiv.org Artificial Intelligence

As large language models (LLMs) become integral to diverse applications, ensuring their reliability under varying input conditions is crucial. One key issue affecting this reliability is order sensitivity, wherein slight variations in input arrangement can lead to inconsistent or biased outputs. Although recent advances have reduced this sensitivity, the problem remains unresolved. This paper investigates the extent of order sensitivity in closed-source LLMs by conducting experiments across multiple tasks, including paraphrasing, relevance judgment, and multiple-choice questions. Our results show that input order significantly affects performance across tasks, with shuffled inputs leading to measurable declines in output accuracy. Few-shot prompting demonstrates mixed effectiveness and offers partial mitigation, however, fails to fully resolve the problem. These findings highlight persistent risks, particularly in high-stakes applications, and point to the need for more robust LLMs or improved input-handling techniques in future development. In recent years, large language models (LLMs) have become essential across various applications, helping users complete tasks in diverse domains, thanks to their remarkable abilities in understanding, analyzing, and generating text (Shen et al., 2023a; Yu et al., 2023). However, LLMs are not without their problems and risks. Many of these issues, such as bias (Talat et al., 2022; Motoki et al., 2023), hallucination (Chen et al., 2023; Sadat et al., 2023), consistency (Tam et al., 2023; Ye et al., 2023), and reliability (Shen et al., 2023b) have been extensively discussed in the literature. However, a more fundamental challenge to the long-term success of LLMs is their ability to reason: the distinguishing factor between probabilistic pattern matching and logical understanding. This distinction has significant implications for the future of LLMs and how we employ these models in decision-making. One necessary requirement for reasoning is order independence.


Learning Answer Generation using Supervision from Automatic Question Answering Evaluators

arXiv.org Artificial Intelligence

Recent studies show that sentence-level extractive QA, i.e., based on Answer Sentence Selection (AS2), is outperformed by Generation-based QA (GenQA) models, which generate answers using the top-k answer sentences ranked by AS2 models (a la retrieval-augmented generation style). In this paper, we propose a novel training paradigm for GenQA using supervision from automatic QA evaluation models (GAVA). Specifically, we propose three strategies to transfer knowledge from these QA evaluation models to a GenQA model: (i) augmenting training data with answers generated by the GenQA model and labelled by GAVA (either statically, before training, or (ii) dynamically, at every training epoch); and (iii) using the GAVA score for weighting the generator loss during the learning of the GenQA model. We evaluate our proposed methods on two academic and one industrial dataset, obtaining a significant improvement in answering accuracy over the previous state of the art.