True Detective: A Deep Abductive Reasoning Benchmark Undoable for GPT-3 and Challenging for GPT-4

Jun-1-2023–arXiv.org Artificial Intelligence

Large language models (LLMs) have demonstrated solid zero-shot reasoning capabilities, which is reflected in their performance on the current test tasks. This calls for a more challenging benchmark requiring highly advanced reasoning ability to be solved. In this paper, we introduce such a benchmark, consisting of 191 long-form (1200 words on average) mystery narratives constructed as detective puzzles. Puzzles are sourced from the "5 Minute Mystery" platform and include a multiple-choice question for evaluation. Only 47% of humans solve a puzzle successfully on average, while the best human solvers achieve over 80% success rate. We show that GPT-3 models barely outperform random on this benchmark (with 28% accuracy) while state-of-the-art GPT-4 solves only 38% of puzzles. This indicates that there is still a significant gap in the deep reasoning abilities of LLMs and humans and highlights the need for further research in this area. Our work introduces a challenging benchmark for future studies on reasoning in language models and contributes to a better understanding of the limits of LLMs' abilities.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

Jun-1-2023

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - Louisiana > Orleans Parish
    - New Orleans (0.04)
  - California > San Diego County
    - San Diego (0.04)
- Europe
  - Portugal > Lisbon
    - Lisbon (0.04)
  - Estonia > Tartu County
    - Tartu (0.04)
  - Belgium > Brussels-Capital Region
    - Brussels (0.04)

Genre:
- Research Report (1.00)

Industry:
- Education (0.66)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found