Long-form factuality in large language models

May-31-2025, 11:33:37 GMT–Neural Information Processing Systems

Large language models (LLMs) often generate content that contains factual errors when responding to fact-seeking prompts on open-ended topics. To benchmark a model's long-form factuality in open domains, we first use GPT-4 to generate LongFact, a prompt set comprising thousands of questions spanning 38 topics. We then propose that LLM agents can be used as automated evaluators for longform factuality through a method which we call Search-Augmented Factuality Evaluator (SAFE). SAFE utilizes an LLM to break down a long-form response into a set of individual facts and to evaluate the accuracy of each fact using a multi-step reasoning process comprising sending search queries to Google Search and determining whether a fact is supported by the search results. Furthermore, we propose extending F1 score as an aggregated metric for long-form factuality.

large language model, machine learning, natural language, (20 more...)

Neural Information Processing Systems

May-31-2025, 11:33:37 GMT

Conferences PDF

Add feedback

Country:
- Africa (0.92)
- Asia > Japan
  - Honshū > Kantō > Kanagawa Prefecture (0.27)
- Europe > Germany (0.67)
- North America > United States
  - California (0.92)
- Oceania > Australia (1.00)

Genre:
- Overview (0.67)
- Personal (1.00)
- Research Report > Experimental Study (1.00)

Industry:
- Media
  - Film (1.00)
  - Music (1.00)
  - Television (1.00)
- Banking & Finance > Economy (1.00)
- Government
  - Foreign Policy (0.67)
  - Immigration & Customs (0.68)
  - Military (1.00)
  - Regional Government
    - Europe Government (0.92)
    - North America Government > United States Government (1.00)
- Health & Medicine
  - Health Care Providers & Services (0.67)
  - Pharmaceuticals & Biotechnology (1.00)
  - Therapeutic Area
    - Immunology (1.00)
    - Infections and Infectious Diseases (1.00)
- Law
  - Civil Rights & Constitutional Law (1.00)
  - Criminal Law (0.67)
  - Environmental Law (0.67)
- Information Technology
  - Security & Privacy (1.00)
  - Services (0.89)
- Education > Educational Setting (0.67)
- Leisure & Entertainment
  - Games (0.67)
  - Sports
    - Baseball (1.00)
    - Golf (0.92)
- Law Enforcement & Public Safety > Crime Prevention & Enforcement (0.67)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)
  - Natural Language > Large Language Model (1.00)