Long-form factuality in large language models
–Neural Information Processing Systems
Large language models (LLMs) often generate content that contains factual errors when responding to fact-seeking prompts on open-ended topics. To benchmark a model's long-form factuality in open domains, we first use GPT-4 to generate LongFact, a prompt set comprising thousands of questions spanning 38 topics. We then propose that LLM agents can be used as automated evaluators for longform factuality through a method which we call Search-Augmented Factuality Evaluator (SAFE). SAFE utilizes an LLM to break down a long-form response into a set of individual facts and to evaluate the accuracy of each fact using a multi-step reasoning process comprising sending search queries to Google Search and determining whether a fact is supported by the search results. Furthermore, we propose extending F1 score as an aggregated metric for long-form factuality.
Neural Information Processing Systems
May-31-2025, 11:33:37 GMT
- Country:
- Africa (0.92)
- Asia > Japan
- Honshū > Kantō > Kanagawa Prefecture (0.27)
- Europe > Germany (0.67)
- North America > United States
- California (0.92)
- Oceania > Australia (1.00)
- Genre:
- Overview (0.67)
- Personal (1.00)
- Research Report > Experimental Study (1.00)
- Industry:
- Media
- Film (1.00)
- Music (1.00)
- Television (1.00)
- Banking & Finance > Economy (1.00)
- Government
- Foreign Policy (0.67)
- Immigration & Customs (0.68)
- Military (1.00)
- Regional Government
- Europe Government (0.92)
- North America Government > United States Government (1.00)
- Health & Medicine
- Law
- Civil Rights & Constitutional Law (1.00)
- Criminal Law (0.67)
- Environmental Law (0.67)
- Information Technology
- Security & Privacy (1.00)
- Services (0.89)
- Education > Educational Setting (0.67)
- Leisure & Entertainment
- Law Enforcement & Public Safety > Crime Prevention & Enforcement (0.67)
- Media
- Technology: