Performance of the Pre-Trained Large Language Model GPT-4 on Automated Short Answer Grading

Sep-17-2023–arXiv.org Artificial Intelligence

Automated Short Answer Grading (ASAG) has been an active area of machine-learning research for over a decade. It promises to let educators grade and give feedback on free-form responses in large-enrollment courses in spite of limited availability of human graders. Over the years, carefully trained models have achieved increasingly higher levels of performance. More recently, pre-trained Large Language Models (LLMs) emerged as a commodity, and an intriguing question is how a general-purpose tool without additional training compares to specialized models. We studied the performance of GPT-4 on the standard benchmark 2-way and 3-way datasets SciEntsBank and Beetle, where in addition to the standard task of grading the alignment of the student answer with a reference answer, we also investigated withholding the reference answer. We found that overall, the performance of the pre-trained general-purpose GPT-4 LLM is comparable to hand-engineered models, but worse than pre-trained LLMs that had specialized training.

gpt-4, reference answer, student answer, (11 more...)

arXiv.org Artificial Intelligence

Sep-17-2023

arXiv.org PDF

Add feedback

Country:
- Asia > Middle East
  - Jordan (0.04)
- Europe > Switzerland
  - Zürich > Zürich (0.14)
- North America > United States
  - District of Columbia > Washington (0.04)

Genre:
- Instructional Material (1.00)

Industry:
- Education
  - Assessment & Standards > Student Performance (0.38)
  - Educational Setting (0.70)
  - Educational Technology (0.47)
- Information Technology > Security & Privacy (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)
  - Natural Language > Large Language Model (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found