SeePhys: Does Seeing Help Thinking? -- Benchmarking Vision-Based Physics Reasoning

Xiang, Kun, Li, Heng, Zhang, Terry Jingchen, Huang, Yinya, Liu, Zirong, Qu, Peixin, He, Jixi, Chen, Jiaqi, Yuan, Yu-Jie, Han, Jianhua, Xu, Hang, Li, Hanhui, Sachan, Mrinmaya, Liang, Xiaodan

Oct-7-2025–arXiv.org Artificial Intelligence

We present SeePhys, a large-scale multimodal benchmark for LLM reasoning grounded in physics questions ranging from middle school to PhD qualifying exams. The benchmark covers 7 fundamental domains spanning the physics discipline, incorporating 21 categories of highly heterogeneous diagrams. In contrast to prior works where visual elements mainly serve auxiliary purposes, our benchmark features a substantial proportion of vision-essential problems (75%) that mandate visual information extraction for correct solutions. Through extensive evaluation, we observe that even the most advanced visual reasoning models (e.g., Gemini-2.5-pro and o4-mini) achieve sub-60% accuracy on our benchmark. These results reveal fundamental challenges in current large language models' visual understanding capabilities, particularly in: (i) establishing rigorous coupling between diagram interpretation and physics reasoning, and (ii) overcoming their persistent reliance on textual cues as cognitive shortcuts.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

Oct-7-2025

arXiv.org PDF

Add feedback

Country:
- Asia (0.93)
- North America > United States (0.28)
- Europe > Austria (0.28)

Genre:
- Research Report > New Finding (0.67)
- Instructional Material > Course Syllabus & Notes (0.46)

Industry:
- Education > Educational Setting
  - Higher Education (0.46)
  - K-12 Education (0.35)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language
    - Large Language Model (1.00)
    - Chatbot (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)