Do Language Models Know When They're Hallucinating References?
Agrawal, Ayush, Suzgun, Mirac, Mackey, Lester, Kalai, Adam Tauman
–arXiv.org Artificial Intelligence
State-of-the-art language models (LMs) are famous for "hallucinating" references. These fabricated article and book titles lead to harms, obstacles to their use, and public backlash. While other types of LM hallucinations are also important, we propose hallucinated references as the "drosophila" of research on hallucination in large language models (LLMs), as they are particularly easy to study. We show that simple search engine queries reliably identify such hallucinations, which facilitates evaluation. To begin to dissect the nature of hallucinated LM references, we attempt to classify them using black-box queries to the same LM, without consulting any external resources. Consistency checks done with "direct" queries about whether the generated reference title is real (inspired by Kadavath et al. 2022, Lin et al. 2022, Manakul et al. 2023) are compared to consistency checks with "indirect" queries which ask for ancillary details such as the authors of the work. These consistency checks are found to be partially reliable indicators of whether or not the reference is a hallucination. In particular, we find that LMs often hallucinate differing authors of hallucinated references when queried in independent sessions, while consistently identify authors of real references. This suggests that the hallucination may be more a generation issue than inherent to current training techniques or representation.
arXiv.org Artificial Intelligence
Sep-13-2023