DAHL: Domain-specific Automated Hallucination Evaluation of Long-Form Text through a Benchmark Dataset in Biomedicine

Seo, Jean, Lim, Jongwon, Jang, Dongjun, Shin, Hyopil

Nov-14-2024–arXiv.org Artificial Intelligence

We introduce DAHL, a benchmark dataset and automated evaluation system designed to assess hallucination in long-form text generation, specifically within the biomedical domain. Our benchmark dataset, meticulously curated from biomedical research papers, consists of 8,573 questions across 29 categories. DAHL evaluates fact-conflicting hallucinations in Large Language Models (LLMs) by deconstructing responses into atomic units, each representing a single piece of information. The accuracy of these responses is averaged to produce the DAHL Score, offering a more in-depth evaluation of hallucinations compared to previous methods that rely on multiple-choice tasks. We conduct experiments with 8 different models, finding that larger models tend to hallucinate less; however, beyond a model size of 7 to 8 billion parameters, further scaling does not significantly improve factual accuracy. The DAHL Score holds potential as an efficient alternative to human-annotated preference labels, being able to be expanded to other specialized domains. We release the dataset and code in public.

atomic unit, dataset, hallucination, (15 more...)

arXiv.org Artificial Intelligence

Nov-14-2024

arXiv.org PDF

Add feedback

Country:
- Asia
  - British Indian Ocean Territory > Diego Garcia (0.04)
  - Middle East
    - Jordan (0.04)
    - Saudi Arabia > Asir Province
      - Abha (0.04)
  - South Korea > Seoul
    - Seoul (0.04)

Genre:
- Research Report (0.64)

Industry:
- Health & Medicine > Therapeutic Area (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.31)
  - Natural Language
    - Chatbot (0.96)
    - Large Language Model (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found