Assessing GPT Performance in a Proof-Based University-Level Course Under Blind Grading

Ding, Ming, Kyng, Rasmus, Solda, Federico, Yuan, Weixuan

arXiv.org Artificial Intelligence 

While LLMs have demonstrated impressive capabilities, their true level of intelligence and reasoning remains a subject of debate. The classical Turing Test proposes that a machine demonstrating human-like responses in conversation could be considered intelligent. Over the past few years, substantial efforts have been devoted to evaluating LLMs from various angles [Cha+24]. For example, LLMs can generate essays with their quality rated higher than those produced by humans [Her+23]; pass questions involving communication skills, ethics, empathy, and professionalism in a United States Medical Licensing Examination (USMLE) [Bri+23]; achieve passing scores on the reading comprehension test of the Program for International Student Assessment (PISA), a global standardized student assessment [V az+23]; and demonstrate strong performance in solving middle school-level math word problems, with multiple LLMs achieving passing scores and some exceeding 90% accuracy [Vid24]. However, existing evaluation protocols may fall short of comprehensively assessing their reasoning and problem-solving capabilities.