Scarlatos, Alexander
Training LLM-based Tutors to Improve Student Learning Outcomes in Dialogues
Scarlatos, Alexander, Liu, Naiming, Lee, Jaewook, Baraniuk, Richard, Lan, Andrew
Recent advances in generative artificial intelligence (AI), including large language models (LLMs), have opened new possibilities in education and in particular on scaling up personalization. One form of personalization that generative AI powers is interactive learning via tutoring dialogues between AI-powered tutors and students. These interactions have the potential to tailor instruction to each student's needs and progress, while offering personalized feedback, all in real time, in a scalable way. Given the widespread success of human tutors for improving student outcomes [29], many recent works have developed LLM-based tutors, showing promise across various educational domains [15, 25, 30, 32, 33, 39, 42, 50]. Many LLM-based tutors are even deployed in practice, such as Khan Academy's Khanmigo [21] and Carnegie Learning's LiveHint [4]. Several preliminary studies have shown that interacting with LLMs can increase student learning [52], although some have shown that students can develop an over-reliance on LLMs which negatively impacts their learning [23]. Many prior works have focused on improving LLMs' ability to follow effective tutoring principles, adapting them for the tutoring task that they are not pre-trained for. One approach, explored in [46], analyzes the decision-making process underlying human tutor utterances, showing that integrating expert decisions enhances LLM-based tutoring. Another study, [28], examines tutor moves in interactions with an LLM-powered simulated student agent, demonstrating that move annotation data contributes to better tutoring performance.
ReaLJam: Real-Time Human-AI Music Jamming with Reinforcement Learning-Tuned Transformers
Scarlatos, Alexander, Wu, Yusong, Simon, Ian, Roberts, Adam, Cooijmans, Tim, Jaques, Natasha, Tarakajian, Cassie, Huang, Cheng-Zhi Anna
Recent advances in generative artificial intelligence (AI) have created models capable of high-quality musical content generation. However, little consideration is given to how to use these models for real-time or cooperative jamming musical applications because of crucial required features: low latency, the ability to communicate planned actions, and the ability to adapt to user input in real-time. To support these needs, we introduce ReaLJam, an interface and protocol for live musical jamming sessions between a human and a Transformer-based AI agent trained with reinforcement learning. We enable real-time interactions using the concept of anticipation, where the agent continually predicts how the performance will unfold and visually conveys its plan to the user. We conduct a user study where experienced musicians jam in real-time with the agent through ReaLJam. Our results demonstrate that ReaLJam enables enjoyable and musically interesting sessions, and we uncover important takeaways for future work.
Evaluating GPT-4 at Grading Handwritten Solutions in Math Exams
Caraeni, Adriana, Scarlatos, Alexander, Lan, Andrew
Recent advances in generative artificial intelligence (AI) have shown promise in accurately grading open-ended student responses. However, few prior works have explored grading handwritten responses due to a lack of data and the challenge of combining visual and textual information. In this work, we leverage state-of-the-art multi-modal AI models, in particular GPT-4o, to automatically grade handwritten responses to college-level math exams. Using real student responses to questions in a probability theory exam, we evaluate GPT-4o's alignment with ground-truth scores from human graders using various prompting techniques. We find that while providing rubrics improves alignment, the model's overall accuracy is still too low for real-world settings, showing there is significant room for growth in this task.
Exploring Knowledge Tracing in Tutor-Student Dialogues using LLMs
Scarlatos, Alexander, Baker, Ryan S., Lan, Andrew
Recent advances in large language models (LLMs) have led to the development of artificial intelligence (AI)-powered tutoring chatbots, showing promise in providing broad access to high-quality personalized education. Existing works have studied how to make LLMs follow tutoring principles, but have not studied broader uses of LLMs for supporting tutoring. Up until now, tracing student knowledge and analyzing misconceptions has been difficult and time-consuming to implement for open-ended dialogue tutoring. In this work, we investigate whether LLMs can be supportive of this task: we first use LLM prompting methods to identify the knowledge components/skills involved in each dialogue turn, i.e., a tutor utterance posing a task or a student utterance that responds to it. We also evaluate whether the student responds correctly to the tutor and verify the LLM's accuracy using human expert annotations. We then apply a range of knowledge tracing (KT) methods on the resulting labeled data to track student knowledge levels over an entire dialogue. We conduct experiments on two tutoring dialogue datasets, and show that a novel yet simple LLM-based method, LLMKT, significantly outperforms existing KT methods in predicting student response correctness in dialogues. We perform extensive qualitative analyses to highlight the challenges in dialogueKT and outline multiple avenues for future work.
DiVERT: Distractor Generation with Variational Errors Represented as Text for Math Multiple-choice Questions
Fernandez, Nigel, Scarlatos, Alexander, Woodhead, Simon, Lan, Andrew
High-quality distractors are crucial to both the assessment and pedagogical value of multiple-choice questions (MCQs), where manually crafting ones that anticipate knowledge deficiencies or misconceptions among real students is difficult. Meanwhile, automated distractor generation, even with the help of large language models (LLMs), remains challenging for subjects like math. It is crucial to not only identify plausible distractors but also understand the error behind them. In this paper, we introduce DiVERT (Distractor Generation with Variational Errors Represented as Text), a novel variational approach that learns an interpretable representation of errors behind distractors in math MCQs. Through experiments on a real-world math MCQ dataset with 1,434 questions used by hundreds of thousands of students, we show that DiVERT, despite using a base open-source LLM with 7B parameters, outperforms state-of-the-art approaches using GPT-4o on downstream distractor generation. We also conduct a human evaluation with math educators and find that DiVERT leads to error labels that are of comparable quality to human-authored ones.
Improving Automated Distractor Generation for Math Multiple-choice Questions with Overgenerate-and-rank
Scarlatos, Alexander, Feng, Wanyong, Smith, Digory, Woodhead, Simon, Lan, Andrew
Multiple-choice questions (MCQs) are commonly used across all levels of math education since they can be deployed and graded at a large scale. A critical component of MCQs is the distractors, i.e., incorrect answers crafted to reflect student errors or misconceptions. Automatically generating them in math MCQs, e.g., with large language models, has been challenging. In this work, we propose a novel method to enhance the quality of generated distractors through overgenerate-and-rank, training a ranking model to predict how likely distractors are to be selected by real students. Experimental results on a real-world dataset and human evaluation with math teachers show that our ranking model increases alignment with human-authored distractors, although human-authored ones are still preferred over generated ones.
Exploring Automated Distractor Generation for Math Multiple-choice Questions via Large Language Models
Feng, Wanyong, Lee, Jaewook, McNichols, Hunter, Scarlatos, Alexander, Smith, Digory, Woodhead, Simon, Ornelas, Nancy Otero, Lan, Andrew
Multiple-choice questions (MCQs) are ubiquitous in almost all levels of education since they are easy to administer, grade, and are a reliable format in assessments and practices. One of the most important aspects of MCQs is the distractors, i.e., incorrect options that are designed to target common errors or misconceptions among real students. To date, the task of crafting high-quality distractors largely remains a labor and time-intensive process for teachers and learning content designers, which has limited scalability. In this work, we study the task of automated distractor generation in the domain of math MCQs and explore a wide variety of large language model (LLM)-based approaches, from in-context learning to fine-tuning. We conduct extensive experiments using a real-world math MCQ dataset and find that although LLMs can generate some mathematically valid distractors, they are less adept at anticipating common errors or misconceptions among real students.
SyllabusQA: A Course Logistics Question Answering Dataset
Fernandez, Nigel, Scarlatos, Alexander, Lan, Andrew
Moreover, text similarity metrics may not be suitable In educational applications, artificial intelligence for some open-ended natural language generation (AI) approaches have shown significant promise in tasks (Amidei et al., 2018). As an example, the answer improving learning outcomes (Aleven et al., 2016; "The final exam will be on Dec 15", has high VanLehn, 2011), by automatically providing feedback surface-level textual similarity with the reference to students or engaging in tutoring dialogues answer, "The final exam is on Dec 14", but contains with them. The key idea is to use AI to create an ondemand a critical factual error that may lead to significant virtual teaching assistant to interact with negative consequences to students. Meanwhile, many students simultaneously; see, e.g., Khamigo human instructors and teaching assistants often answer from Khan Academy (Academy, 2022). These approaches student questions in a concise way, without can scale up the effort of expert human giving any unnecessary information. Therefore, it teachers and tutors, and relieve them from doing is important for LLM-based approaches to generate repetitive tasks so that they can focus on providing answers that are both concise and precise.
Improving the Validity of Automatically Generated Feedback via Reinforcement Learning
Scarlatos, Alexander, Smith, Digory, Woodhead, Simon, Lan, Andrew
Automatically generating feedback via large language models (LLMs) in intelligent tutoring systems and online learning platforms has the potential to improve the learning outcomes of many students. However, both feedback generation and evaluation are challenging: feedback content has to be valid especially in subjects like math, which requires models to understand the problem, the solution, and where the student's error lies. Feedback also has to be pedagogically valid to reflect effective tutoring strategies, such as explaining possible misconceptions and encouraging the student, among other desirable features. In this work, we address both problems of automatically generating and evaluating feedback while considering both correctness and alignment. First, we propose a rubric for evaluating math feedback and show that GPT-4 is able to effectively use it to annotate human-written and LLM-generated feedback. Second, we propose a framework for feedback generation that optimizes both correctness and alignment using reinforcement learning (RL). Specifically, we use GPT-4's annotations to create preferences over feedback pairs in an augmented dataset for training via direct preference optimization (DPO). We show that our methods significantly increase the correctness and alignment of generated feedback with Llama 2, an open-source LLM, qualitatively analyze our generation and evaluation systems using case studies, and outline several areas for future work.
Automated Distractor and Feedback Generation for Math Multiple-choice Questions via In-context Learning
McNichols, Hunter, Feng, Wanyong, Lee, Jaewook, Scarlatos, Alexander, Smith, Digory, Woodhead, Simon, Lan, Andrew
Multiple-choice questions (MCQs) are ubiquitous in almost all levels of education since they are easy to administer, grade, and are a reliable form of assessment. An important aspect of MCQs is the distractors, i.e., incorrect options that are designed to target specific misconceptions or insufficient knowledge among students. To date, the task of crafting high-quality distractors has largely remained a labor-intensive process for teachers and learning content designers, which has limited scalability. In this work, we explore the task of automated distractor and corresponding feedback message generation in math MCQs using large language models. We establish a formulation of these two tasks and propose a simple, in-context learning-based solution. Moreover, we propose generative AI-based metrics for evaluating the quality of the feedback messages. We conduct extensive experiments on these tasks using a real-world MCQ dataset. Our findings suggest that there is a lot of room for improvement in automated distractor and feedback generation; based on these findings, we outline several directions for future work.