Benchmarking LLM Causal Reasoning with Scientifically Validated Relationships

Lee, Donggyu, Park, Sungwon, Hwang, Yerin, Kim, Hyoshin, Oh, Hyunwoo, Kim, Jungwon, Cha, Meeyoung, Park, Sangyoon, Kim, Jihee

Oct-10-2025–arXiv.org Artificial Intelligence

Causal reasoning is fundamental for Large Language Models (LLMs) to understand genuine cause-and-effect relationships beyond pattern matching. Existing benchmarks suffer from critical limitations such as reliance on synthetic data and narrow domain coverage. We introduce a novel benchmark constructed from casually identified relationships extracted from top-tier economics and finance journals, drawing on rigorous methodologies including instrumental variables, difference-in-differences, and regression discontinuity designs. Our benchmark comprises 40,379 evaluation items covering five task types across domains such as health, environment, technology, law, and culture. Experimental results on eight state-of-the-art LLMs reveal substantial limitations, with the best model achieving only 57.6\% accuracy. Moreover, model scale does not consistently translate to superior performance, and even advanced reasoning models struggle with fundamental causal relationship identification. These findings underscore a critical gap between current LLM capabilities and demands of reliable causal reasoning in high-stakes applications.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

Oct-10-2025

arXiv.org PDF

Add feedback

Country:
- Asia (0.46)
- North America > United States (0.28)
- Europe (0.28)

Genre:
- Research Report > New Finding (0.88)

Industry:
- Law (0.67)
- Banking & Finance > Economy (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Representation & Reasoning (1.00)
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.51)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found