Assessing GPT Performance in a Proof-Based University-Level Course Under Blind Grading
Ding, Ming, Kyng, Rasmus, Solda, Federico, Yuan, Weixuan
–arXiv.org Artificial Intelligence
While LLMs have demonstrated impressive capabilities, their true level of intelligence and reasoning remains a subject of debate. The classical Turing Test proposes that a machine demonstrating human-like responses in conversation could be considered intelligent. Over the past few years, substantial efforts have been devoted to evaluating LLMs from various angles [Cha+24]. For example, LLMs can generate essays with their quality rated higher than those produced by humans [Her+23]; pass questions involving communication skills, ethics, empathy, and professionalism in a United States Medical Licensing Examination (USMLE) [Bri+23]; achieve passing scores on the reading comprehension test of the Program for International Student Assessment (PISA), a global standardized student assessment [V az+23]; and demonstrate strong performance in solving middle school-level math word problems, with multiple LLMs achieving passing scores and some exceeding 90% accuracy [Vid24]. However, existing evaluation protocols may fall short of comprehensively assessing their reasoning and problem-solving capabilities.
arXiv.org Artificial Intelligence
May-21-2025
- Country:
- Europe
- Switzerland > Zürich
- Zürich (0.04)
- United Kingdom (0.04)
- Switzerland > Zürich
- North America > United States (0.34)
- Europe
- Genre:
- Instructional Material > Course Syllabus & Notes (1.00)
- Research Report > New Finding (1.00)
- Industry:
- Education
- Assessment & Standards > Student Performance (0.66)
- Curriculum > Subject-Specific Education (1.00)
- Educational Setting (1.00)
- Education
- Technology: