FlagEval Findings Report: A Preliminary Evaluation of Large Reasoning Models on Automatically Verifiable Textual and Visual Questions

Qin, Bowen, Yue, Chen, Yin, Fang, Wang, Hui, Yao, JG, Liu, Jiakang, Zheng, Jing-Shu, Chen, Miguel Hu, Xuan, Richeng, Meng, Shibei, Zhou, Shiqi, Dai, Teng, Ren, Tong-Shuai, Cui, Wei, Yang, Xi, Du, Xialin, Xu, Xiaojing, Sun, Xue, Li, Xuejing, Liu, Yaming, Liu, Yesheng, Liu, Ying, Lin, Yonghua, Zhao, Yu, Zhang, Yunduo, Luo, Yuwen, He, Zheqi, He, Zhiyuan, Wang, Zhongyuan

Nov-26-2025–arXiv.org Artificial Intelligence

We conduct a moderate-scale contamination-free (to some extent) evaluation of current large reasoning models (LRMs) with some preliminary findings. We also release ROME, our evaluation benchmark for vision language models intended to test reasoning from visual clues. We attach links to the benchmark, evaluation data, and other updates on this website: https://flageval-baai.github.io/LRM-Eval/

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

Nov-26-2025

arXiv.org PDF

Add feedback

Country:
- Europe (1.00)
- North America > United States (0.67)

Genre:
- Research Report > New Finding (1.00)

Industry:
- Information Technology (1.00)
- Health & Medicine (0.93)
- Media > Television (0.67)
- Leisure & Entertainment
  - Games (0.67)
  - Sports > Soccer (0.46)

Technology:
- Information Technology
  - Sensing and Signal Processing > Image Processing (1.00)
  - Artificial Intelligence
    - Vision (1.00)
    - Representation & Reasoning (1.00)
    - Natural Language
      - Large Language Model (1.00)
      - Chatbot (1.00)
    - Machine Learning > Neural Networks
      - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found