Assessing GPT Performance in a Proof-Based University-Level Course Under Blind Grading

Ding, Ming, Kyng, Rasmus, Solda, Federico, Yuan, Weixuan

May-21-2025–arXiv.org Artificial Intelligence

While LLMs have demonstrated impressive capabilities, their true level of intelligence and reasoning remains a subject of debate. The classical Turing Test proposes that a machine demonstrating human-like responses in conversation could be considered intelligent. Over the past few years, substantial efforts have been devoted to evaluating LLMs from various angles [Cha+24]. For example, LLMs can generate essays with their quality rated higher than those produced by humans [Her+23]; pass questions involving communication skills, ethics, empathy, and professionalism in a United States Medical Licensing Examination (USMLE) [Bri+23]; achieve passing scores on the reading comprehension test of the Program for International Student Assessment (PISA), a global standardized student assessment [V az+23]; and demonstrate strong performance in solving middle school-level math word problems, with multiple LLMs achieving passing scores and some exceeding 90% accuracy [Vid24]. However, existing evaluation protocols may fall short of comprehensively assessing their reasoning and problem-solving capabilities.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

May-21-2025

arXiv.org PDF

Add feedback

Country:
- Europe
  - Switzerland > Zürich
    - Zürich (0.04)
  - United Kingdom (0.04)
- North America > United States (0.34)

Genre:
- Instructional Material > Course Syllabus & Notes (1.00)
- Research Report > New Finding (1.00)

Industry:
- Education
  - Assessment & Standards > Student Performance (0.66)
  - Curriculum > Subject-Specific Education (1.00)
  - Educational Setting (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)
  - Natural Language > Large Language Model (1.00)