Goto

Collaborating Authors

 formative assessment


A Theory of Adaptive Scaffolding for LLM-Based Pedagogical Agents

Cohn, Clayton, Rayala, Surya, Srivastava, Namrata, Fonteles, Joyce Horn, Jain, Shruti, Luo, Xinying, Mereddy, Divya, Mohammed, Naveeduddin, Biswas, Gautam

arXiv.org Artificial Intelligence

Large language models (LLMs) present new opportunities for creating pedagogical agents that engage in meaningful dialogue to support student learning. However, the current use of LLM systems like ChatGPT in classrooms often lacks the solid theoretical foundation found in earlier intelligent tutoring systems. To bridge this gap, we propose a framework that combines Evidence-Centered Design with Social Cognitive Theory for adaptive scaffolding in LLM-based agents focused on STEM+C learning. We illustrate this framework with In-quizzitor, an LLM-based formative assessment agent that integrates human-AI hybrid intelligence and provides feedback grounded in cognitive science principles. Our findings show that Inquizzitor delivers high-quality assessment and interaction aligned with core learning theories, offering teachers effective guidance that students value. This research underscores the potential for theory-driven LLM integration in education, highlighting the ability of these systems to provide adaptive and principled instruction.


AI-driven formative assessment and adaptive learning in data-science education: Evaluating an LLM-powered virtual teaching assistant

Anaroua, Fadjimata I, Li, Qing, Tang, Yan, Liu, Hong P.

arXiv.org Artificial Intelligence

This paper presents VITA (Virtual Teaching Assistants), an adaptive distributed learning (ADL) platform that embeds a large language model (LLM)-powered chatbot (BotCaptain) to provide dialogic support, interoperable analytics, and integrity-aware assessment for workforce preparation in data science. The platform couples context-aware conversational tutoring with formative-assessment patterns designed to promote reflective reasoning. The paper describes an end-to-end data pipeline that transforms chat logs into Experience API (xAPI) statements, instructor dashboards that surface outliers for just-in-time intervention, and an adaptive pathway engine that routes learners among progression, reinforcement, and remediation content. The paper also benchmarks VITA conceptually against emerging tutoring architectures, including retrieval-augmented generation (RAG)--based assistants and Learning Tools Interoperability (LTI)--integrated hubs, highlighting trade-offs among content grounding, interoperability, and deployment complexity. Contributions include a reusable architecture for interoperable conversational analytics, a catalog of patterns for integrity-preserving formative assessment, and a practical blueprint for integrating adaptive pathways into data-science courses. The paper concludes with implementation lessons and a roadmap (RAG integration, hallucination mitigation, and LTI~1.3 / OpenID Connect) to guide multi-course evaluations and broader adoption. In light of growing demand and scalability constraints in traditional instruction, the approach illustrates how conversational AI can support engagement, timely feedback, and personalized learning at scale. Future work will refine the platform's adaptive intelligence and examine applicability across varied educational settings.


CoTAL: Human-in-the-Loop Prompt Engineering for Generalizable Formative Assessment Scoring

Cohn, Clayton, S, Ashwin T, Mohammed, Naveeduddin, Biswas, Gautam

arXiv.org Artificial Intelligence

Large language models (LLMs) have created new opportunities to assist teachers and support student learning. While researchers have explored various prompt engineering approaches in educational contexts, the degree to which these approaches generalize across domains--such as science, computing, and engineering--remains underexplored. In this paper, we introduce Chain-of-Thought Prompting + Active Learning (CoTAL), an LLM-based approach to formative assessment scoring that (1) leverages Evidence-Centered Design (ECD) to align assessments and rubrics with curriculum goals, (2) applies human-in-the-loop prompt engineering to automate response scoring, and (3) incorporates chain-of-thought (CoT) prompting and teacher and student feedback to iteratively refine questions, rubrics, and LLM prompts. Our findings demonstrate that CoTAL improves GPT-4's scoring performance across domains, achieving gains of up to 38.9% over a non-prompt-engineered baseline (i.e., without labeled examples, chain-of-thought prompting, or iterative refinement). Teachers and students judge CoTAL to be effective at scoring and explaining responses, and their feedback produces valuable insights that enhance grading accuracy and explanation quality.


The Impact of AI on Educational Assessment: A Framework for Constructive Alignment

Stokkink, Patrick

arXiv.org Artificial Intelligence

The influence of Artificial Intelligence (AI), and specifically Large Language Models (LLM), on education is continuously increasing. These models are frequently used by students, giving rise to the question whether current forms of assessment are still a valid way to evaluate student performance and comprehension. The theoretical framework developed in this paper is grounded in Constructive Alignment (CA) theory and Bloom's taxonomy for defining learning objectives. We argue that AI influences learning objectives of different Bloom levels in a different way, and assessment has to be adopted accordingly. Furthermore, in line with Bloom's vision, formative and summative assessment should be aligned on whether the use of AI is permitted or not. Although lecturers tend to agree that education and assessment need to be adapted to the presence of AI, a strong bias exists on the extent to which lecturers want to allow for AI in assessment. This bias is caused by a lecturer's familiarity with AI and specifically whether they use it themselves. To avoid this bias, we propose structured guidelines on a university or faculty level, to foster alignment among the staff. Besides that, we argue that teaching staff should be trained on the capabilities and limitations of AI tools. In this way, they are better able to adapt their assessment methods.


Learning to Love Edge Cases in Formative Math Assessment: Using the AMMORE Dataset and Chain-of-Thought Prompting to Improve Grading Accuracy

Henkel, Owen, Horne-Robinson, Hannah, Dyshel, Maria, Ch, Nabil, Moreau-Pernet, Baptiste, Abood, Ralph

arXiv.org Artificial Intelligence

This paper introduces AMMORE, a new dataset of 53,000 math open-response question-answer pairs from Rori, a learning platform used by students in several African countries and conducts two experiments to evaluate the use of large language models (LLM) for grading particularly challenging student answers. The AMMORE dataset enables various potential analyses and provides an important resource for researching student math acquisition in understudied, real-world, educational contexts. In experiment 1 we use a variety of LLM-driven approaches, including zero-shot, few-shot, and chain-of-thought prompting, to grade the 1% of student answers that a rule-based classifier fails to grade accurately. We find that the best-performing approach -- chain-of-thought prompting -- accurately scored 92% of these edge cases, effectively boosting the overall accuracy of the grading from 98.7% to 99.9%. In experiment 2, we aim to better understand the consequential validity of the improved grading accuracy, by passing grades generated by the best-performing LLM-based approach to a Bayesian Knowledge Tracing (BKT) model, which estimated student mastery of specific lessons. We find that relatively modest improvements in model accuracy at the individual question level can lead to significant changes in the estimation of student mastery. Where the rules-based classifier currently used to grade student, answers misclassified the mastery status of 6.9% of students across their completed lessons, using the LLM chain-of-thought approach this misclassification rate was reduced to 2.6% of students. Taken together, these findings suggest that LLMs could be a valuable tool for grading open-response questions in K-12 mathematics education, potentially enabling encouraging wider adoption of open-ended questions in formative assessment.


A Chain-of-Thought Prompting Approach with LLMs for Evaluating Students' Formative Assessment Responses in Science

Cohn, Clayton, Hutchins, Nicole, Le, Tuan, Biswas, Gautam

arXiv.org Artificial Intelligence

This paper explores the use of large language models (LLMs) to score and explain short-answer assessments in K-12 science. While existing methods can score more structured math and computer science assessments, they often do not provide explanations for the scores. Our study focuses on employing GPT-4 for automated assessment in middle school Earth Science, combining few-shot and active learning with chain-of-thought reasoning. Using a human-in-the-loop approach, we successfully score and provide meaningful explanations for formative assessment responses. A systematic analysis of our method's pros and cons sheds light on the potential for human-in-the-loop techniques to enhance automated grading for open-ended science assessments.


Using State-of-the-Art Speech Models to Evaluate Oral Reading Fluency in Ghana

Henkel, Owen, Horne-Robinson, Hannah, Hills, Libby, Roberts, Bill, McGrane, Joshua

arXiv.org Artificial Intelligence

This paper reports on a set of three recent experiments utilizing large-scale speech models to evaluate the oral reading fluency (ORF) of students in Ghana. While ORF is a well-established measure of foundational literacy, assessing it typically requires one-on-one sessions between a student and a trained evaluator, a process that is time-consuming and costly. Automating the evaluation of ORF could support better literacy instruction, particularly in education contexts where formative assessment is uncommon due to large class sizes and limited resources. To our knowledge, this research is among the first to examine the use of the most recent versions of large-scale speech models (Whisper V2 wav2vec2.0) for ORF assessment in the Global South. We find that Whisper V2 produces transcriptions of Ghanaian students reading aloud with a Word Error Rate of 13.5. This is close to the model's average WER on adult speech (12.8) and would have been considered state-of-the-art for children's speech transcription only a few years ago. We also find that when these transcriptions are used to produce fully automated ORF scores, they closely align with scores generated by expert human graders, with a correlation coefficient of 0.96. Importantly, these results were achieved on a representative dataset (i.e., students with regional accents, recordings taken in actual classrooms), using a free and publicly available speech model out of the box (i.e., no fine-tuning). This suggests that using large-scale speech models to assess ORF may be feasible to implement and scale in lower-resource, linguistically diverse educational contexts.


Telangana: AI to aid Govt schools in formative assessments

#artificialintelligence

Hyderabad: Select Government school complexes (a cluster of high school, middle and primary schools) in the State may soon get to use some artificial intelligence based tools that will automate a few of the time and resources consuming processes like formative assessments, marking attendance, logging mid-day meals data among others. They will even be put to use to teach English and later other languages too. These artificial intelligence-based tools will be implemented in select school complexes in Moinabad as a pilot project through the Prof Raj Centre at IIIT-H, which is working to create artificial technologies and technology solutions for the grassroots. A team from the IIIT-H had initial meetings with select school complex head masters, resource persons and officials of the Education Department to understand problems at the grass-root level. "We plan to meet the concerned people once again and make specific plans for the technology interventions possible. We want to keep the technologies ready for the coming academic year. These will be short term projects of three to six months that aim to address the issues at the earliest," said Ramesh Loganathan, Co-Innvation Professor at IIIT-H.


Future of Testing in Education: Artificial Intelligence - Center for American Progress

#artificialintelligence

This series is about the future of testing in America's schools. Part one of the series presents a theory of action that assessments should play in schools. Part two--this issue brief--reviews advancements in technology, with a focus on artificial intelligence that can powerfully drive learning in real time. And the third part looks at assessment designs that can improve large-scale standardized tests. Despite the often-negative discussion about testing in schools, assessments are a necessary and useful tool in the teaching and learning process.1


AI and Formative Assessment

#artificialintelligence

In my last post, I talked about effective formative assessments and their powerful impact on student learning. In this post, let's explore why AI is well-suited for formative assessment. I think individualized feedback is the most powerful advantage of AI for assessment. As a teacher, I can only be in one place at a time looking in one direction at a time. That means I have two choices for feedback: I can take some time to assess how each student is doing and then address general learning barriers as a class, or I can assess and give feedback to students one at a time.