Exploring LLM Autoscoring Reliability in Large-Scale Writing Assessments Using Generalizability Theory

Song, Dan, Lee, Won-Chan, Jiao, Hong

arXiv.org Artificial Intelligence 

Using g eneralizability t heory, the research evaluates and compares score consistency between human and AI raters across two types of AP Chinese free - response writing tasks: story narration and email response. These essays were independently scored by two trained human raters and seven AI raters. Each essay received four scores: one holistic score and three analytic scores corres ponding to the domains of task completion, delivery, and language use. Results indicate that although human raters produced more reliable scores overall, LLMs demonstrated reasonable consistency under certain conditions, particularly for story narration tasks. Composite scoring that incorporates both human and AI raters improved reliability, which supports that hybrid scoring models may offer benefits for large - scale writing assessments. Keywords: large language model; a utomated essay s coring; generalizability theory; w riting a ssessment; AI - h uman c omparison 2 Exploring LLM Autoscoring Reliability in Large - Scale Writing Assessments Using Generalizability Theory The integration of large language models (LLMs) into a utomated e ssay s coring (AES) represents a significant shift in how essay scoring is approached. While traditional AES systems have long depended on manually engineered features and statistical models (Attali & Burstein, 2006; Dikli, 2006), LLMs offer the potential to assess student writing with greater flexibility and contextual sensitivi ty by drawing on deep learning architectures trained on diverse textual corpora ( Ifenthaler, 2022; Ouyang et al., 2022). However, despite their promising capabilities, recent studies indic ate that LLMs have not yet consistently matched the scoring reliability of established AES tools or trained human raters, especially in high - stakes language assessment contexts (Mizumoto & Eguchi, 2023; Xiao et al., 2025; Y ancey et al., 2023). These concerns highlight the need for rigorous evaluation of LLM - based scoring systems, particularly with respect to their reliability and alignment with human scoring standards. This study addresses these challenges by applying generalizability theory to systematical ly examine the consistency of LLM - generated scores on standardized writing tasks in the AP Chinese Language and Culture Exam (AP Chinese Exam) . Literature Review This section reviews the literature on AES and the application of LLMs to AES. It also provides brief overviews of generalizability theory and the AP Chinese Language and Culture Exam, followed by the research questions addressed in this study.