Exposing Attention Glitches with Flip-Flop Language Modeling

Liu, Bingbin, Ash, Jordan T., Goel, Surbhi, Krishnamurthy, Akshay, Zhang, Cyril

arXiv.org Artificial Intelligence 

Recent advancements in scale have yielded large language models (LLMs) with extraordinary proficiency in nuanced reasoning with factual knowledge. Despite these achievements, LLMs are known to produce incorrect outputs, often referred to colloquially as "hallucinations" or "distractions" (Ji et al., 2023). Generally, hallucinations refer to the phenomenon that a model's outputs are syntactically and grammatically accurate but factually incorrect. There are various types of hallucinations, and the focus of this work is the "closeddomain" variety (Saparov and He, 2022; OpenAI, 2023), where the model predictions contain factually incorrect or made-up information according to a given context, regardless of their correctness in the real world. Perhaps surprisingly, such hallucinations can be observed even on simple algorithmic reasoning tasks. As a warmup, consider the queries shown in Figure 1 (and Appendix B.1), where we prompt LLMs to solve addition problems of various lengths. The responses simultaneously illustrate the following: 1. Nontrivial algorithmic generalization: In cases where the models succeed, it is unlikely that these exact numerical sequences appeared in the training data. To correctly output the first digit of the answer, the LLM must resolve a long dependency chain which generally depends on every digit in the input. Somewhere within these networks' internal representations, implementations of addition algorithms have emerged.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found