AITopics | Assessment Methods

Collaborating Authors

Assessment Methods

Scaling Equitable Reflection Assessment in Education via Large Language Models and Role-Based Feedback Agents

arXiv.org Artificial IntelligenceDec-1-2025

Formative feedback is widely recognized as one of the most effective drivers of student learning, yet it remains difficult to implement equitably at scale. In large or low-resource courses, instructors often lack the time, staffing, and bandwidth required to review and respond to every student reflection, creating gaps in support precisely where learners would benefit most. This paper presents a theory-grounded system that uses five coordinated role-based LLM agents (Evaluator, Equity Monitor, Metacognitive Coach, Aggregator, and Reflexion Reviewer) to score learner reflections with a shared rubric and to generate short, bias-aware, learner-facing comments. The agents first produce structured rubric scores, then check for potentially biased or exclusionary language, add metacognitive prompts that invite students to think about their own thinking, and finally compose a concise feedback message of at most 120 words. The system includes simple fairness checks that compare scoring error across lower and higher scoring learners, enabling instructors to monitor and bound disparities in accuracy. We evaluate the pipeline in a 12-session AI literacy program with adult learners. In this setting, the system produces rubric scores that approach expert-level agreement, and trained graders rate the AI-generated comments as helpful, empathetic, and well aligned with instructional goals. Taken together, these results show that multi-agent LLM systems can deliver equitable, high-quality formative feedback at a scale and speed that would be impossible for human graders alone. More broadly, the work points toward a future where feedback-rich learning becomes feasible for any course size or context, advancing long-standing goals of equity, access, and instructional capacity in education.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2511.11772

Genre:

Instructional Material > Course Syllabus & Notes (0.66)
Research Report > New Finding (0.48)

Industry:

Education > Curriculum > Subject-Specific Education (0.94)
Education > Assessment & Standards > Assessment Methods (0.71)
Education > Educational Setting > Online (0.68)
Education > Educational Technology > Educational Software > Computer-Aided Assessment (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

Feedback Indicators: The Alignment between Llama and a Teacher in Language Learning

Rüdian, Sylvio, Elsir, Yassin, Kretschmer, Marvin, Cayrou, Sabine, Pinkwart, Niels

arXiv.org Artificial IntelligenceAug-18-2025

Automated feedback generation has the potential to enhance students' learning progress by providing timely and targeted feedback. Moreover, it can assist teachers in optimizing their time, allowing them to focus on more strategic and personalized aspects of teaching. To generate high-quality, information-rich formative feedback, it is essential first to extract relevant indicators, as these serve as the foundation upon which the feedback is constructed. Teachers often employ feedback criteria grids composed of various indicators that they evaluate systematically. This study examines the initial phase of extracting such indicators from students' submissions of a language learning course using the large language model Llama 3.1. Accordingly, the alignment between indicators generated by the LLM and human ratings across various feedback criteria is investigated. The findings demonstrate statistically significant strong correlations, even in cases involving unanticipated combinations of indicators and criteria. The methodology employed in this paper offers a promising foundation for extracting indicators from students' submissions using LLMs. Such indicators can potentially be utilized to auto-generate explainable and transparent formative feedback in future research.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2508.11364

Country: Europe > Germany (0.29)

Genre: Research Report > New Finding (1.00)

Industry:

Education > Curriculum > Subject-Specific Education (0.87)
Education > Educational Setting > Online (0.69)
Education > Assessment & Standards > Assessment Methods (0.69)
Education > Educational Technology > Educational Software > Computer Based Training (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.89)

Add feedback

CoTAL: Human-in-the-Loop Prompt Engineering for Generalizable Formative Assessment Scoring

Cohn, Clayton, S, Ashwin T, Mohammed, Naveeduddin, Biswas, Gautam

arXiv.org Artificial IntelligenceAug-15-2025

Large language models (LLMs) have created new opportunities to assist teachers and support student learning. While researchers have explored various prompt engineering approaches in educational contexts, the degree to which these approaches generalize across domains--such as science, computing, and engineering--remains underexplored. In this paper, we introduce Chain-of-Thought Prompting + Active Learning (CoTAL), an LLM-based approach to formative assessment scoring that (1) leverages Evidence-Centered Design (ECD) to align assessments and rubrics with curriculum goals, (2) applies human-in-the-loop prompt engineering to automate response scoring, and (3) incorporates chain-of-thought (CoT) prompting and teacher and student feedback to iteratively refine questions, rubrics, and LLM prompts. Our findings demonstrate that CoTAL improves GPT-4's scoring performance across domains, achieving gains of up to 38.9% over a non-prompt-engineered baseline (i.e., without labeled examples, chain-of-thought prompting, or iterative refinement). Teachers and students judge CoTAL to be effective at scoring and explaining responses, and their feedback produces valuable insights that enhance grading accuracy and explanation quality.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2504.02323

Country:

North America > United States (0.28)
Europe (0.28)

Genre: Research Report > New Finding (1.00)

Industry:

Education > Educational Technology > Educational Software > Computer Based Training (1.00)
Education > Educational Setting > Online (1.00)
Education > Curriculum > Subject-Specific Education (1.00)
(4 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Challenges and Research Directions from the Operational Use of a Machine Learning Damage Assessment System via Small Uncrewed Aerial Systems at Hurricanes Debby and Helene

Manzini, Thomas, Perali, Priyankari, Murphy, Robin R., Merrick, David

arXiv.org Artificial IntelligenceJun-23-2025

-- This paper details four principal challenges encountered with machine learning (ML) damage assessment using small uncrewed aerial systems (sUAS) at Hurricanes Debby and Helene that prevented, degraded, or delayed the delivery of data products during operations and suggests three research directions for future real-world deployments. The presence of these challenges is not surprising given that a review of the literature considering both datasets and proposed ML models suggests this is the first sUAS-based ML system for disaster damage assessment actually deployed as a part of real-world operations. The sUAS-based ML system was applied by the State of Florida to Hurricanes Helene (2 orthomosaics, 3.0 gigapixels collected over 2 sorties by a Wintra WingtraOne sUAS) and Debby (1 orthomosaic, 0.59 gigapixels collected via 1 sortie by a Wintra WingtraOne sUAS) in Florida. The same model was applied to crewed aerial imagery of inland flood damage resulting from post-tropical remnants of Hurricane Debby in Pennsylvania (436 orthophotos, 136.5 gigapixels), providing further insights into the advantages and limitations of sUAS for disaster response. The four challenges (variation in spatial resolution of input imagery, spatial misalignment between imagery and geospatial data, wireless connectivity, and data product format) lead to three recommendations that specify research needed to improve ML model capabilities to accommodate the wide variation of potential spatial resolutions used in practice, handle spatial misalignment, and minimize the dependency on wireless connectivity. These recommendations are expected to improve the effective operational use of sUAS and sUAS-based ML damage assessment systems for disaster response.

artificial intelligence, imagery, machine learning, (13 more...)

arXiv.org Artificial Intelligence

2506.1589

Country:

North America > United States > Pennsylvania (0.35)
North America > United States > Florida (0.34)
North America > United States > Texas > Brazos County > College Station (0.14)

Genre: Research Report (0.84)

Industry:

Aerospace & Defense (1.00)
Government > Regional Government > North America Government > United States Government (0.66)
Education > Assessment & Standards > Assessment Methods (0.63)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Towards an intelligent assessment system for evaluating the development of algorithmic thinking skills: An exploratory study in Swiss compulsory schools

Adorni, Giorgia

arXiv.org Artificial IntelligenceMar-27-2025

The rapid digitalisation of contemporary society has profoundly impacted various facets of our lives, including healthcare, communication, business, and education. The ability to engage with new technologies and solve problems has become crucial, making CT skills, such as pattern recognition, decomposition, and algorithm design, essential competencies. In response, Switzerland is conducting research and initiatives to integrate CT into its educational system. This study aims to develop a comprehensive framework for large-scale assessment of CT skills, particularly focusing on AT, the ability to design algorithms. To achieve this, we first developed a competence model capturing the situated and developmental nature of CT, guiding the design of activities tailored to cognitive abilities, age, and context. This framework clarifies how activity characteristics influence CT development and how to assess these competencies. Additionally, we developed an activity for large-scale assessment of AT skills, offered in two variants: one based on non-digital artefacts (unplugged) and manual expert assessment, and the other based on digital artefacts (virtual) and automatic assessment. To provide a more comprehensive evaluation of students' competencies, we developed an IAS based on BNs with noisy gates, which offers real-time probabilistic assessment for each skill rather than a single overall score. The results indicate that the proposed instrument can measure AT competencies across different age groups and educational contexts in Switzerland, demonstrating its applicability for large-scale use. AT competencies exhibit a progressive development, with no overall gender differences, though variations are observed at the school level, significantly influenced by the artefact-based environment and its context, underscoring the importance of creating accessible and adaptable assessment tools.

artificial intelligence, development and implementation figure 7, machine learning, (23 more...)

arXiv.org Artificial Intelligence

2503.22756

Country:

Europe > Ireland (0.14)
North America > United States > California > San Francisco County > San Francisco (0.13)
Europe > Austria > Vienna (0.13)
(46 more...)

Genre:

Workflow (1.00)
Summary/Review (1.00)
Research Report > New Finding (1.00)
(3 more...)

Industry:

Health & Medicine > Therapeutic Area (1.00)
Government > Regional Government (1.00)
Education > Educational Technology > Educational Software > Computer Based Training (1.00)
(7 more...)

Technology:

Information Technology > Software > Programming Languages (1.00)
Information Technology > Software Engineering (1.00)
Information Technology > Human Computer Interaction (1.00)
(10 more...)

Add feedback

R.U.Psycho? Robust Unified Psychometric Testing of Language Models

Schelb, Julian, Borin, Orr, Garcia, David, Spitz, Andreas

arXiv.org Artificial IntelligenceMar-13-2025

Generative language models are increasingly being subjected to psychometric questionnaires intended for human testing, in efforts to establish their traits, as benchmarks for alignment, or to simulate participants in social science experiments. While this growing body of work sheds light on the likeness of model responses to those of humans, concerns are warranted regarding the rigour and reproducibility with which these experiments may be conducted. Instabilities in model outputs, sensitivity to prompt design, parameter settings, and a large number of available model versions increase documentation requirements. Consequently, generalization of findings is often complex and reproducibility is far from guaranteed. In this paper, we present R.U.Psycho, a framework for designing and running robust and reproducible psychometric experiments on generative language models that requires limited coding expertise. We demonstrate the capability of our framework on a variety of psychometric questionnaires, which lend support to prior findings in the literature. R.U.Psycho is available as a Python package at https://github.com/julianschelb/rupsycho.

answer option, experiment, questionnaire, (15 more...)

arXiv.org Artificial Intelligence

2503.10229

Country:

North America > United States > Florida > Miami-Dade County > Miami (0.04)
North America > United States > Alaska (0.04)

Genre:

Questionnaire & Opinion Survey (1.00)
Research Report > New Finding (0.93)

Industry:

Transportation > Air (1.00)
Education > Assessment & Standards > Assessment Methods (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.95)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.95)

Add feedback

eRevise+RF: A Writing Evaluation System for Assessing Student Essay Revisions and Providing Formative Feedback

Liu, Zhexiong, Litman, Diane, Wang, Elaine, Li, Tianwen, Gobat, Mason, Matsumura, Lindsay Clare, Correnti, Richard

arXiv.org Artificial IntelligenceDec-31-2024

The ability to revise essays in response to feedback is important for students' writing success. An automated writing evaluation (AWE) system that supports students in revising their essays is thus essential. We present eRevise+RF, an enhanced AWE system for assessing student essay revisions (e.g., changes made to an essay to improve its quality in response to essay feedback) and providing revision feedback. We deployed the system with 6 teachers and 406 students across 3 schools in Pennsylvania and Louisiana. The results confirmed its effectiveness in (1) assessing student essays in terms of evidence usage, (2) extracting evidence and reasoning revisions across essays, and (3) determining revision success in responding to feedback. The evaluation also suggested eRevise+RF is a helpful system for young students to improve their argumentative writing skills through revision and formative feedback.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2501.00715

Country:

North America > United States > Pennsylvania (0.34)
North America > United States > Minnesota (0.28)
North America > United States > Louisiana (0.24)

Genre: Research Report > New Finding (0.46)

Industry:

Education > Curriculum > Subject-Specific Education (1.00)
Education > Assessment & Standards > Assessment Methods (0.84)
Education > Educational Setting > K-12 Education (0.66)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)

Add feedback

An Automated Explainable Educational Assessment System Built on LLMs

Li, Jiazheng, Bobrov, Artem, West, David, Aloisi, Cesare, He, Yulan

arXiv.org Artificial IntelligenceDec-17-2024

In this demo, we present AERA Chat, an automated and explainable educational assessment system designed for interactive and visual evaluations of student responses. This system leverages large language models (LLMs) to generate automated marking and rationale explanations, addressing the challenge of limited explainability in automated educational assessment and the high costs associated with annotation. Our system allows users to input questions and student answers, providing educators and researchers with insights into assessment accuracy and the quality of LLM-assessed rationales. Additionally, it offers advanced visualization and robust evaluation tools, enhancing the usability for educational assessment and facilitating efficient rationale verification. Our demo video can be found at https://youtu.be/qUSjz-sxlBc.

large language model, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2412.13381

Genre: Research Report (0.40)

Industry:

Education > Assessment & Standards > Student Performance (0.97)
Education > Assessment & Standards > Assessment Methods (0.61)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

NLP Cluster Analysis of Common Core State Standards and NAEP Item Specifications

Camilli, Gregory, Suter, Larry

arXiv.org Artificial IntelligenceDec-13-2024

Camilli (2024) proposed a methodology using natural language processing (NLP) to map the relationship of a set of content standards to item specifications. This study provided evidence that NLP can be used to improve the mapping process. As part of this investigation, the nominal classifications of standards and items specifications were used to examine construct equivalence. In the current paper, we determine the strength of empirical support for the semantic distinctiveness of these classifications, which are known as "domains" for Common Core standards, and "strands" for National Assessment of Educational Progress (NAEP) item specifications. This is accomplished by separate k-means clustering for standards and specifications of their corresponding embedding vectors. We then briefly illustrate an application of these findings.

machine learning, natural language, specification, (14 more...)

arXiv.org Artificial Intelligence

2412.04482

Country:

North America > United States > District of Columbia > Washington (0.05)
North America > United States > New Jersey > Bergen County > Mahwah (0.04)
North America > United States > Michigan (0.04)
(2 more...)

Genre: Research Report > Experimental Study (0.34)

Industry: Education > Assessment & Standards > Assessment Methods (0.72)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.34)

Add feedback

Introducing Flexible Monotone Multiple Choice Item Response Theory Models and Bit Scales

Wallmark, Joakim, Josefsson, Maria, Wiberg, Marie

arXiv.org Machine LearningOct-2-2024

Item Response Theory (IRT) is a powerful statistical approach for evaluating test items and determining test taker abilities through response analysis. An IRT model that better fits the data leads to more accurate latent trait estimates. In this study, we present a new model for multiple choice data, the monotone multiple choice (MMC) model, which we fit using autoencoders. Using both simulated scenarios and real data from the Swedish Scholastic Aptitude Test, we demonstrate empirically that the MMC model outperforms the traditional nominal response IRT model in terms of fit. Furthermore, we illustrate how the latent trait scale from any fitted IRT model can be transformed into a ratio scale, aiding in score interpretation and making it easier to compare different types of IRT models. We refer to these new scales as bit scales. Bit scales are especially useful for models for which minimal or no assumptions are made for the latent trait scale distributions, such as for the autoencoder fitted models in this study.

irt model, mmc model, test taker, (14 more...)

arXiv.org Machine Learning

2410.0148

Country:

North America > United States > Tennessee > Knox County > Knoxville (0.04)
North America > United States > Massachusetts > Middlesex County > Reading (0.04)
North America > United States > Florida > Palm Beach County > Boca Raton (0.04)
(2 more...)

Genre: Research Report > New Finding (1.00)

Industry: Education > Assessment & Standards > Assessment Methods (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback