FlagEval Findings Report: A Preliminary Evaluation of Large Reasoning Models on Automatically Verifiable Textual and Visual Questions
Qin, Bowen, Yue, Chen, Yin, Fang, Wang, Hui, Yao, JG, Liu, Jiakang, Zheng, Jing-Shu, Chen, Miguel Hu, Xuan, Richeng, Meng, Shibei, Zhou, Shiqi, Dai, Teng, Ren, Tong-Shuai, Cui, Wei, Yang, Xi, Du, Xialin, Xu, Xiaojing, Sun, Xue, Li, Xuejing, Liu, Yaming, Liu, Yesheng, Liu, Ying, Lin, Yonghua, Zhao, Yu, Zhang, Yunduo, Luo, Yuwen, He, Zheqi, He, Zhiyuan, Wang, Zhongyuan
–arXiv.org Artificial Intelligence
We conduct a moderate-scale contamination-free (to some extent) evaluation of current large reasoning models (LRMs) with some preliminary findings. We also release ROME, our evaluation benchmark for vision language models intended to test reasoning from visual clues. We attach links to the benchmark, evaluation data, and other updates on this website: https://flageval-baai.github.io/LRM-Eval/
arXiv.org Artificial Intelligence
Nov-26-2025
- Country:
- Asia > Thailand
- Europe
- Austria > Vienna (0.14)
- Belgium > Flanders
- Flemish Brabant > Leuven (0.04)
- France > Hauts-de-France
- Ireland > Leinster
- County Dublin > Dublin (0.04)
- Latvia (0.04)
- Netherlands (0.04)
- Spain (0.04)
- United Kingdom > England
- Greater London > London (0.04)
- North America > United States
- Florida > Miami-Dade County
- Miami (0.04)
- New Mexico > Bernalillo County
- Albuquerque (0.04)
- Texas (0.04)
- Florida > Miami-Dade County
- Genre:
- Research Report > New Finding (1.00)
- Industry:
- Health & Medicine (0.93)
- Information Technology (1.00)
- Leisure & Entertainment
- Media > Television (0.67)
- Technology: