Ishii, Etsuko
LLM Internal States Reveal Hallucination Risk Faced With a Query
Ji, Ziwei, Chen, Delong, Ishii, Etsuko, Cahyawijaya, Samuel, Bang, Yejin, Wilie, Bryan, Fung, Pascale
The hallucination problem of Large Language Models (LLMs) significantly limits their reliability and trustworthiness. Humans have a self-awareness process that allows us to recognize what we don't know when faced with queries. Inspired by this, our paper investigates whether LLMs can estimate their own hallucination risk before response generation. We analyze the internal mechanisms of LLMs broadly both in terms of training data sources and across 15 diverse Natural Language Generation (NLG) tasks, spanning over 700 datasets. Our empirical analysis reveals two key insights: (1) LLM internal states indicate whether they have seen the query in training data or not; and (2) LLM internal states show they are likely to hallucinate or not regarding the query. Our study explores particular neurons, activation layers, and tokens that play a crucial role in the LLM perception of uncertainty and hallucination risk. By a probing estimator, we leverage LLM self-assessment, achieving an average hallucination estimation accuracy of 84.32\% at run time.
Belief Revision: The Adaptability of Large Language Models Reasoning
Wilie, Bryan, Cahyawijaya, Samuel, Ishii, Etsuko, He, Junxian, Fung, Pascale
The capability to reason from text is crucial for real-world NLP applications. Real-world scenarios often involve incomplete or evolving data. In response, individuals update their beliefs and understandings accordingly. However, most existing evaluations assume that language models (LMs) operate with consistent information. We introduce Belief-R, a new dataset designed to test LMs' belief revision ability when presented with new evidence. Inspired by how humans suppress prior inferences, this task assesses LMs within the newly proposed delta reasoning ($\Delta R$) framework. Belief-R features sequences of premises designed to simulate scenarios where additional information could necessitate prior conclusions drawn by LMs. We evaluate $\sim$30 LMs across diverse prompting strategies and found that LMs generally struggle to appropriately revise their beliefs in response to new information. Further, models adept at updating often underperformed in scenarios without necessary updates, highlighting a critical trade-off. These insights underscore the importance of improving LMs' adaptiveness to changing information, a step toward more reliable AI systems.
Contrastive Learning for Inference in Dialogue
Ishii, Etsuko, Xu, Yan, Wilie, Bryan, Ji, Ziwei, Lovenia, Holy, Chung, Willy, Fung, Pascale
Inference, especially those derived from inductive processes, is a crucial component in our conversation to complement the information implicitly or explicitly conveyed by a speaker. While recent large language models show remarkable advances in inference tasks, their performance in inductive reasoning, where not all information is present in the context, is far behind deductive reasoning. In this paper, we analyze the behavior of the models based on the task difficulty defined by the semantic information gap -- which distinguishes inductive and deductive reasoning (Johnson-Laird, 1988, 1993). Our analysis reveals that the disparity in information between dialogue contexts and desired inferences poses a significant challenge to the inductive inference process. To mitigate this information gap, we investigate a contrastive learning approach by feeding negative samples. Our experiments suggest negative samples help models understand what is wrong and improve their inference generations.
Towards Mitigating Hallucination in Large Language Models via Self-Reflection
Ji, Ziwei, Yu, Tiezheng, Xu, Yan, Lee, Nayeon, Ishii, Etsuko, Fung, Pascale
Large language models (LLMs) have shown promise for generative and knowledge-intensive tasks including question-answering (QA) tasks. However, the practical deployment still faces challenges, notably the issue of "hallucination", where models generate plausible-sounding but unfaithful or nonsensical information. This issue becomes particularly critical in the medical domain due to the uncommon professional concepts and potential social risks involved. This paper analyses the phenomenon of hallucination in medical generative QA systems using widely adopted LLMs and datasets. Our investigation centers on the identification and comprehension of common problematic answers, with a specific emphasis on hallucination. To tackle this challenge, we present an interactive self-reflection methodology that incorporates knowledge acquisition and answer generation. Through this feedback process, our approach steadily enhances the factuality, consistency, and entailment of the generated answers. Consequently, we harness the interactivity and multitasking ability of LLMs and produce progressively more precise and accurate answers. Experimental results on both automatic and human evaluation demonstrate the superiority of our approach in hallucination reduction compared to baselines.
Survey of Hallucination in Natural Language Generation
Ji, Ziwei, Lee, Nayeon, Frieske, Rita, Yu, Tiezheng, Su, Dan, Xu, Yan, Ishii, Etsuko, Bang, Yejin, Dai, Wenliang, Madotto, Andrea, Fung, Pascale
Natural Language Generation (NLG) has improved exponentially in recent years thanks to the development of sequence-to-sequence deep learning technologies such as Transformer-based language models. This advancement has led to more fluent and coherent NLG, leading to improved development in downstream tasks such as abstractive summarization, dialogue generation and data-to-text generation. However, it is also apparent that deep learning based generation is prone to hallucinate unintended text, which degrades the system performance and fails to meet user expectations in many real-world scenarios. To address this issue, many studies have been presented in measuring and mitigating hallucinated texts, but these have never been reviewed in a comprehensive manner before. In this survey, we thus provide a broad overview of the research progress and challenges in the hallucination problem in NLG. The survey is organized into two parts: (1) a general overview of metrics, mitigation methods, and future directions; and (2) an overview of task-specific research progress on hallucinations in the following downstream tasks, namely abstractive summarization, dialogue generation, generative question answering, data-to-text generation, machine translation, and visual-language generation. This survey serves to facilitate collaborative efforts among researchers in tackling the challenge of hallucinated texts in NLG.
VScript: Controllable Script Generation with Visual Presentation
Ji, Ziwei, Xu, Yan, Cheng, I-Tsun, Cahyawijaya, Samuel, Frieske, Rita, Ishii, Etsuko, Zeng, Min, Madotto, Andrea, Fung, Pascale
In order to offer a customized script tool and inspire professional scriptwriters, we present VScript. It is a controllable pipeline that generates complete scripts, including dialogues and scene descriptions, as well as presents visually using video retrieval. With an interactive interface, our system allows users to select genres and input starting words that control the theme and development of the generated script. We adopt a hierarchical structure, which first generates the plot, then the script and its visual Figure 1: An example of the generated script (right) presentation. A novel approach is also introduced with its visual presentation (top left) from VScript. to plot-guided dialogue generation by Given the inputs, i.e., genre and starting words, a plot treating it as an inverse dialogue summarization. is generated, which guides the generation of a script The experiment results show that our approach consisting of a scene description and a dialogue. The outperforms the baselines on both automatic words highlighted in pink show the belongingness to and human evaluations, especially in the given genre (Sci-Fi).
Greenformer: Factorization Toolkit for Efficient Deep Neural Networks
Cahyawijaya, Samuel, Winata, Genta Indra, Lovenia, Holy, Wilie, Bryan, Dai, Wenliang, Ishii, Etsuko, Fung, Pascale
While the recent advances in deep neural networks (DNN) bring remarkable success, the computational cost also increases considerably. In this paper, we introduce Greenformer, a toolkit to accelerate the computation of neural networks through matrix factorization while maintaining performance. Greenformer can be easily applied with a single line of code to any DNN model. Our experimental results show that Greenformer is effective for a wide range of scenarios.
Assessing Political Prudence of Open-domain Chatbots
Bang, Yejin, Lee, Nayeon, Ishii, Etsuko, Madotto, Andrea, Fung, Pascale
Politically sensitive topics are still a challenge for open-domain chatbots. However, dealing with politically sensitive content in a responsible, non-partisan, and safe behavior way is integral for these chatbots. Currently, the main approach to handling political sensitivity is by simply changing such a topic when it is detected. This is safe but evasive and results in a chatbot that is less engaging. In this work, as a first step towards a politically safe chatbot, we propose a group of metrics for assessing their political prudence. We then conduct political prudence analysis of various chatbots and discuss their behavior from multiple angles Figure 1: Illustration of responses from different chatbots through our automatic metric and human in a political conversation. Abortion law is a topic evaluation metrics. The testsets and codebase that often leads to divisive political debates.
CAiRE in DialDoc21: Data Augmentation for Information-Seeking Dialogue System
Ishii, Etsuko, Xu, Yan, Winata, Genta Indra, Lin, Zhaojiang, Madotto, Andrea, Liu, Zihan, Xu, Peng, Fung, Pascale
Information-seeking dialogue systems, including knowledge identification and response generation, aim to respond to users with fluent, coherent, and informative responses based on users' needs, which. To tackle this challenge, we utilize data augmentation methods and several training techniques with the pre-trained language models to learn a general pattern of the task and thus achieve promising performance. In DialDoc21 competition, our system achieved 74.95 F1 score and 60.74 Exact Match score in subtask 1, and 37.72 SacreBLEU score in subtask 2. Empirical analysis is provided to explain the effectiveness of our approaches.
Retrieval-Free Knowledge-Grounded Dialogue Response Generation with Adapters
Xu, Yan, Ishii, Etsuko, Liu, Zihan, Winata, Genta Indra, Su, Dan, Madotto, Andrea, Fung, Pascale
To diversify and enrich generated dialogue responses, knowledge-grounded dialogue has been investigated in recent years. Despite the success of the existing methods, they mainly follow the paradigm of retrieving the relevant sentences over a large corpus and augment the dialogues with explicit extra information, which is time- and resource-consuming. In this paper, we propose KnowExpert, an end-to-end framework to bypass the retrieval process by injecting prior knowledge into the pre-trained language models with lightweight adapters. To the best of our knowledge, this is the first attempt to tackle this task relying solely on a generation-based approach. Experimental results show that KnowExpert performs comparably with the retrieval-based baselines, demonstrating the potential of our proposed direction.