What to Ask Next? Probing the Imaginative Reasoning of LLMs with TurtleSoup Puzzles

Zhou, Mengtao, Wu, Sifan, Zhang, Huan, Sima, Qi, Liu, Bang

Aug-15-2025–arXiv.org Artificial Intelligence

We investigate the capacity of Large Language Models (LLMs) for imaginative reasoning--the proactive construction, testing, and revision of hypotheses in information-sparse environments. Existing benchmarks, often static or focused on social deduction, fail to capture the dynamic, exploratory nature of this reasoning process. To address this gap, we introduce a comprehensive research framework based on the classic "Turtle Soup" game, integrating a benchmark, an agent, and an evaluation protocol. We present TurtleSoup-Bench, the first large-scale, bilingual, interactive benchmark for imaginative reasoning, comprising 800 turtle soup puzzles sourced from both the Internet and expert authors. We also propose Mosaic-Agent, a novel agent designed to assess LLMs' performance in this setting. To evaluate reasoning quality, we develop a multi-dimensional protocol measuring logical consistency, detail completion, and conclusion alignment. Experiments with leading LLMs reveal clear capability limits, common failure patterns, and a significant performance gap compared to humans. Our work offers new insights into LLMs' imaginative reasoning and establishes a foundation for future research on exploratory agent behavior.

information, large language model, machine learning, (20 more...)

arXiv.org Artificial Intelligence

Aug-15-2025

arXiv.org PDF

Add feedback

Country:
- Asia (1.00)
- North America > United States (0.67)

Genre:
- Overview (1.00)
- Research Report > New Finding (0.92)

Industry:
- Health & Medicine (1.00)
- Leisure & Entertainment > Games (0.93)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.69)