MultiHoax: A Dataset of Multi-hop False-Premise Questions

Shafiei, Mohammadamin, Saffari, Hamidreza, Moosavi, Nafise Sadat

Jun-5-2025–arXiv.org Artificial Intelligence

As Large Language Models are increasingly deployed in high-stakes domains, their ability to detect false assumptions and reason critically is crucial for ensuring reliable outputs. False-premise questions (FPQs) serve as an important evaluation method by exposing cases where flawed assumptions lead to incorrect responses. While existing benchmarks focus on single-hop FPQs, real-world reasoning often requires multi-hop inference, where models must verify consistency across multiple reasoning steps rather than relying on surface-level cues. To address this gap, we introduce MultiHoax, a benchmark for evaluating LLMs' ability to handle false premises in complex, multi-step reasoning tasks. Our dataset spans seven countries and ten diverse knowledge categories, using Wikipedia as the primary knowledge source to enable factual reasoning across regions. Experiments reveal that state-of-the-art LLMs struggle to detect false premises across different countries, knowledge categories, and multi-hop reasoning types, highlighting the need for improved false premise detection and more robust multi-hop reasoning capabilities in LLMs.

false premise, large language model, machine learning, (18 more...)

arXiv.org Artificial Intelligence

Jun-5-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States (1.00)
- Europe (1.00)
- Asia > Middle East
  - Iran (0.28)

Genre:
- Research Report (1.00)
- Personal > Honors (1.00)

Industry:
- Education (0.93)
- Media > News (0.68)
- Leisure & Entertainment > Sports
  - Olympic Games (1.00)
  - Soccer (0.93)

Technology:
- Information Technology > Artificial Intelligence
  - Representation & Reasoning (1.00)
  - Natural Language > Large Language Model (1.00)
  - Cognitive Science > Problem Solving (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.94)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found