Measuring short-form factuality in large language models
Wei, Jason, Karina, Nguyen, Chung, Hyung Won, Jiao, Yunxin Joy, Papay, Spencer, Glaese, Amelia, Schulman, John, Fedus, William
–arXiv.org Artificial Intelligence
An open problem in artificial intelligence is how to train language models that produce responses that are factually correct. Current frontier models sometimes produce false outputs or answers that are not substantiated by evidence, a problem known as "hallucinations." Such hallucinations are one of the major barriers for broader adoption of general forms artificial intelligence like large language models. Factuality is a complicated topic because it is hard to measure--evaluating the factuality of any given arbitrary claim can be challenging, and language models often generate long completions that contain dozens of factual claims. In this work we will sidestep the open-endedness of language models by considering only short, fact-seeking questions with a single answer. This reduction of scope is important because it makes measuring factuality much more tractable, albeit at the cost of leaving open research questions such as whether improved behavior on short-form factuality generalizes to long-form factuality. We present a benchmark called SimpleQA, which contains 4,326 short, fact-seeking questions. SimpleQA was designed with a few important properties in mind: High correctness. Reference answers to questions are determined by two independent AI trainers, and questions were written in such a way that the predicted answers are easily gradable.
arXiv.org Artificial Intelligence
Nov-6-2024
- Country:
- North America > United States > California > San Francisco County > San Francisco (0.14)
- Genre:
- Research Report > New Finding (0.34)
- Industry:
- Leisure & Entertainment (0.94)
- Technology: