Measuring short-form factuality in large language models

Wei, Jason, Karina, Nguyen, Chung, Hyung Won, Jiao, Yunxin Joy, Papay, Spencer, Glaese, Amelia, Schulman, John, Fedus, William

Nov-6-2024–arXiv.org Artificial Intelligence

An open problem in artificial intelligence is how to train language models that produce responses that are factually correct. Current frontier models sometimes produce false outputs or answers that are not substantiated by evidence, a problem known as "hallucinations." Such hallucinations are one of the major barriers for broader adoption of general forms artificial intelligence like large language models. Factuality is a complicated topic because it is hard to measure--evaluating the factuality of any given arbitrary claim can be challenging, and language models often generate long completions that contain dozens of factual claims. In this work we will sidestep the open-endedness of language models by considering only short, fact-seeking questions with a single answer. This reduction of scope is important because it makes measuring factuality much more tractable, albeit at the cost of leaving open research questions such as whether improved behavior on short-form factuality generalizes to long-form factuality. We present a benchmark called SimpleQA, which contains 4,326 short, fact-seeking questions. SimpleQA was designed with a few important properties in mind: High correctness. Reference answers to questions are determined by two independent AI trainers, and questions were written in such a way that the predicted answers are easily gradable.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

Nov-6-2024

arXiv.org PDF

Add feedback

Country:
- North America > United States > California > San Francisco County > San Francisco (0.14)

Genre:
- Research Report > New Finding (0.34)

Industry:
- Leisure & Entertainment (0.94)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning
    - Neural Networks > Deep Learning (0.75)
    - Performance Analysis > Accuracy (0.46)
  - Natural Language > Large Language Model (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found