Automated essay scoring (AES) is a broadly used application of machine learning, with a long history of real-world use that impacts high-stakes decision-making for students. However, defensibility arguments in this space have typically been rooted in hand-crafted features and psychometrics research, which are a poor fit for recent advances in AI research and more formative classroom use of the technology. This paper proposes a framework for evaluating automated essay scoring models trained with more modern algorithms, used in a classroom setting; that framework is then applied to evaluate an existing product, Turnitin Revision Assistant.
In this article, we describe a deployed educational technology application: the Criterion Online Essay Evaluation Service, a web-based system that provides automated scoring and evaluation of student essays. Criterion has two complementary applications: (1) CritiqueWriting Analysis Tools, a suite of programs that detect errors in grammar, usage, and mechanics, that identify discourse elements in the essay, and that recognize potentially undesirable elements of style, and (2) e-rater version 2.0, an automated essay scoring system. Critique and e-rater provide students with feedback that is specific to their writing in order to help them improve their writing skills and is intended to be used under the instruction of a classroom teacher. All of these capabilities outperform baseline algorithms, and some of the tools agree with human judges in their evaluations as often as two judges agree with each other.
Roscoe, Rod D. (Arizona State University) | Crossley, Scott A. (Georgia State University) | Snow, Erica L. (Arizona State University) | Varner, Laura K. (Arizona State University) | McNamara, Danielle S. (Arizona State University)
Automated essay scoring tools are often criticized on the basis of construct validity. Specifically, it has been argued that computational scoring algorithms may be unaligned to higher-level indicators of quality writing, such as writers’ demonstrated knowledge and understanding of the essay topics. In this paper, we consider how and whether the scoring algorithms within an intelligent writing tutor correlate with measures of writing proficiency and students’ general knowledge, reading comprehension, and vocabulary skill. Results indicate that the computational algorithms, although less attuned to knowledge and comprehension factors than human raters, were marginally related to such variables. Implications for improving automated scoring and intelligent tutoring of writing are briefly discussed.
In this paper, we present a new comparative study on automatic essay scoring (AES). The current state-of-the-art natural language processing (NLP) neural network architectures are used in this work to achieve above human-level accuracy on the publicly available Kaggle AES dataset. We compare two powerful language models, BERT and XLNet, and describe all the layers and network architectures in these models. We elucidate the network architectures of BERT and XLNet using clear notation and diagrams and explain the advantages of transformer architectures over traditional recurrent neural network architectures. Linear algebra notation is used to clarify the functions of transformers and attention mechanisms. We compare the results with more traditional methods, such as bag of words (BOW) and long short term memory (LSTM) networks.
We demonstrate that current state-of-the-art approaches to Automated Essay Scoring (AES) are not well-suited to capturing adversarially crafted input of grammatical but incoherent sequences of sentences. We develop a neural model of local coherence that can effectively learn connectedness features between sentences, and propose a framework for integrating and jointly training the local coherence model with a state-of-the-art AES model. We evaluate our approach against a number of baselines and experimentally demonstrate its effectiveness on both the AES task and the task of flagging adversarial input, further contributing to the development of an approach that strengthens the validity of neural essay scoring models.