Exploring the Utilities of the Rationales from Large Language Models to Enhance Automated Essay Scoring

Nov-3-2025–arXiv.org Artificial Intelligence

Exploring the Utilities of the Rationales from Large Language Models to Enhance Automated Essay Scoring Hong Jiao University of Maryland, College Park Hanna Choi University of Maryland, College Park Haowei Hua Princeton University Abstract This study explored the utilities of rationales generated by GPT-4.1 and GPT -5 in automated scoring using Prompt 6 essays from the 2012 Kaggle ASAP data . Essay-based scoring was compared with rationale-based scoring. The study found in general essay -based scoring performed better than rationale -based scoring with higher Quadratic Weighted Kappa (QWK). However, rationale-based scoring led to higher scoring accuracy in terms of F1 scores for score 0 which had less representation due to class imbalance issues . The ensemble modeling of essay-based scoring models increased the scoring accuracy at both specific score levels and across all score levels. The ensemble modeling of essay -based scoring and each of the rationale-based scoring performed about the same. Further ensemble of essay -based scoring and both rationale-based scoring yielded the best scoring accuracy with QWK of 0.870 compared with 0.848 reported in literature. Introduction Automated essay scoring methodology develops along with the advances in AI technology. Starting from the early supervised machine learning models based on engineered features ( e.g., Mahana et al., 2012) to recent use of large language models (LLMs), the methods for automated essay scoring as demonstrated in Appendix A evolved with the advances in machine learning, deep learning, language models, and LLMs. Using automated scoring of Prompt 6 in the Automated Student Assessment Prize (ASAP) dataset from Kaggle, this study intends to explore the utility of rationales generated by LLMs in enhancing automated essay scoring. For the ASAP Prompt 6, automated scoring models have been developed since 2012 after the Kaggle competition.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

Nov-3-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States > Maryland > Prince George's County > College Park (0.44)

Genre:
- Research Report
  - New Finding (0.66)
  - Experimental Study (0.46)

Industry:
- Education
  - Assessment & Standards > Student Performance (1.00)
  - Educational Technology > Educational Software
    - Computer-Aided Assessment (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found