Can AI grade your essays? A comparative analysis of large language models and teacher ratings in multidimensional essay scoring

Seßler, Kathrin, Fürstenberg, Maurice, Bühler, Babette, Kasneci, Enkelejda

Nov-25-2024–arXiv.org Artificial Intelligence

The manual assessment and grading of student writing is a time-consuming yet critical task for teachers. Recent developments in generative AI, such as large language models, offer potential solutions to facilitate essay-scoring tasks for teachers. In our study, we evaluate the performance and reliability of both open-source and closed-source LLMs in assessing German student essays, comparing their evaluations to those of 37 teachers across 10 pre-defined criteria (i.e., plot logic, expression). A corpus of 20 real-world essays from Year 7 and 8 students was analyzed using five LLMs: GPT-3.5, GPT-4, o1, LLaMA 3-70B, and Mixtral 8x7B, aiming to provide in-depth insights into LLMs' scoring capabilities. Closed-source GPT models outperform open-source models in both internal consistency and alignment with human ratings, particularly excelling in language-related criteria. The novel o1 model outperforms all other LLMs, achieving Spearman's $r = .74$ with human assessments in the overall score, and an internal consistency of $ICC=.80$. These findings indicate that LLM-based assessment can be a useful tool to reduce teacher workload by supporting the evaluation of essays, especially with regard to language-related criteria. However, due to their tendency for higher scores, the models require further refinement to better capture aspects of content quality.

criteria, large language model, machine learning, (20 more...)

arXiv.org Artificial Intelligence

Nov-25-2024

arXiv.org PDF

Add feedback

Country:
- Europe > Germany
  - Baden-Württemberg > Tübingen Region > Tübingen (0.14)
- North America > United States
  - Minnesota > Hennepin County > Minneapolis (0.14)

Genre:
- Research Report > New Finding (1.00)

Industry:
- Education
  - Assessment & Standards > Student Performance (1.00)
  - Curriculum > Subject-Specific Education (0.87)
  - Educational Setting (1.00)
  - Educational Technology > Educational Software
    - Computer Based Training (0.47)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning > Generative AI (0.48)
  - Natural Language > Large Language Model (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found