WikiContradict: A Benchmark for Evaluating LLMs on Real-World Knowledge Conflicts from Wikipedia

Mar-27-2025, 07:21:39 GMT–Neural Information Processing Systems

Retrieval-augmented generation (RAG) has emerged as a promising solution to mitigate the limitations of large language models (LLMs), such as hallucinations and outdated information. However, it remains unclear how LLMs handle knowledge conflicts arising from different augmented retrieved passages, especially when these passages originate from the same source and have equal trustworthiness. In this work, we conduct a comprehensive evaluation of LLM-generated answers to questions that have varying answers based on contradictory passages from Wikipedia, a dataset widely regarded as a high-quality pre-training resource for most LLMs. Specifically, we introduce WikiContradict, a benchmark consisting of 253 highquality, human-annotated instances designed to assess the performance of LLMs in providing a complete perspective on conflicts from the retrieved documents, rather than choosing one answer over another, when augmented with retrieved passages containing real-world knowledge conflicts. We benchmark a diverse range of both closed and open-source LLMs under different QA scenarios, including RAG with a single passage, and RAG with 2 contradictory passages.

information, large language model, machine learning, (21 more...)

Neural Information Processing Systems

Mar-27-2025, 07:21:39 GMT

Conferences PDF

Add feedback

Country:
- Asia
  - Middle East > UAE (0.14)
  - Thailand (0.14)
- Europe > Middle East
  - Malta (0.14)
- North America > United States (0.14)

Genre:
- Research Report (1.00)

Industry:
- Government (1.00)
- Information Technology (0.93)
- Law (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)
  - Natural Language > Large Language Model (1.00)