Goto

Collaborating Authors

 Lin, Jionghao


VTutor: An Open-Source SDK for Generative AI-Powered Animated Pedagogical Agents with Multi-Media Output

arXiv.org Artificial Intelligence

The rapid evolution of large language models (LLMs) has transformed human-computer interaction (HCI), but the interaction with LLMs is currently mainly focused on text-based interactions, while other multi-model approaches remain under-explored. This paper introduces VTutor, an open-source Software Development Kit (SDK) that combines generative AI with advanced animation technologies to create engaging, adaptable, and realistic APAs for human-AI multi-media interactions. VTutor leverages LLMs for real-time personalized feedback, advanced lip synchronization for natural speech alignment, and WebGL rendering for seamless web integration. Supporting various 2D and 3D character models, VTutor enables researchers and developers to design emotionally resonant, contextually adaptive learning agents. This toolkit enhances learner engagement, feedback receptivity, and human-AI interaction while promoting trustworthy AI principles in education. VTutor sets a new standard for next-generation APAs, offering an accessible, scalable solution for fostering meaningful and immersive human-AI interaction experiences. The VTutor project is open-sourced and welcomes community-driven contributions and showcases.


Augmenting Human-Annotated Training Data with Large Language Model Generation and Distillation in Open-Response Assessment

arXiv.org Artificial Intelligence

Large Language Models (LLMs) like GPT-4o can help automate text classification tasks at low cost and scale. However, there are major concerns about the validity and reliability of LLM outputs. By contrast, human coding is generally more reliable but expensive to procure at scale. In this study, we propose a hybrid solution to leverage the strengths of both. We combine human-coded data and synthetic LLM-produced data to fine-tune a classical machine learning classifier, distilling both into a smaller BERT model. We evaluate our method on a human-coded test set as a validity measure for LLM output quality. In three experiments, we systematically vary LLM-generated samples' size, variety, and consistency, informed by best practices in LLM tuning. Our findings indicate that augmenting datasets with synthetic samples improves classifier performance, with optimal results achieved at an 80% synthetic to 20% human-coded data ratio. Lower temperature settings of 0.3, corresponding to less variability in LLM generations, produced more stable improvements but also limited model learning from augmented samples. In contrast, higher temperature settings (0.7 and above) introduced greater variability in performance estimates and, at times, lower performance. Hence, LLMs may produce more uniform output that classifiers overfit to earlier or produce more diverse output that runs the risk of deteriorating model performance through information irrelevant to the prediction task. Filtering out inconsistent synthetic samples did not enhance performance. We conclude that integrating human and LLM-generated data to improve text classification models in assessment offers a scalable solution that leverages both the accuracy of human coding and the variety of LLM outputs.


Data Augmentation for Sparse Multidimensional Learning Performance Data Using Generative AI

arXiv.org Artificial Intelligence

Learning performance data describe correct and incorrect answers or problem-solving attempts in adaptive learning, such as in intelligent tutoring systems (ITSs). Learning performance data tend to be highly sparse (80\%\(\sim\)90\% missing observations) in most real-world applications due to adaptive item selection. This data sparsity presents challenges to using learner models to effectively predict future performance explore new hypotheses about learning. This article proposes a systematic framework for augmenting learner data to address data sparsity in learning performance data. First, learning performance is represented as a three-dimensional tensor of learners' questions, answers, and attempts, capturing longitudinal knowledge states during learning. Second, a tensor factorization method is used to impute missing values in sparse tensors of collected learner data, thereby grounding the imputation on knowledge tracing tasks that predict missing performance values based on real observations. Third, a module for generating patterns of learning is used. This study contrasts two forms of generative Artificial Intelligence (AI), including Generative Adversarial Networks (GANs) and Generate Pre-Trained Transformers (GPT) to generate data associated with different clusters of learner data. We tested this approach on an adult literacy dataset from AutoTutor lessons developed for Adult Reading Comprehension (ARC). We found that: (1) tensor factorization improved the performance in tracing and predicting knowledge mastery compared with other knowledge tracing techniques without data augmentation, showing higher relative fidelity for this imputation method, and (2) the GAN-based simulation showed greater overall stability and less statistical bias based on a divergence evaluation with varying simulation sample sizes compared to GPT.


Combining Large Language Models with Tutoring System Intelligence: A Case Study in Caregiver Homework Support

arXiv.org Artificial Intelligence

Caregivers (i.e., parents and members of a child's caring community) are underappreciated stakeholders in learning analytics. Although caregiver involvement can enhance student academic outcomes, many obstacles hinder involvement, most notably knowledge gaps with respect to modern school curricula. An emerging topic of interest in learning analytics is hybrid tutoring, which includes instructional and motivational support. Caregivers assert similar roles in homework, yet it is unknown how learning analytics can support them. Our past work with caregivers suggested that conversational support is a promising method of providing caregivers with the guidance needed to effectively support student learning. We developed a system that provides instructional support to caregivers through conversational recommendations generated by a Large Language Model (LLM). Addressing known instructional limitations of LLMs, we use instructional intelligence from tutoring systems while conducting prompt engineering experiments with the open-source Llama 3 LLM. This LLM generated message recommendations for caregivers supporting their child's math practice via chat. Few-shot prompting and combining real-time problem-solving context from tutoring systems with examples of tutoring practices yielded desirable message recommendations. These recommendations were evaluated with ten middle school caregivers, who valued recommendations facilitating content-level support and student metacognition through self-explanation. We contribute insights into how tutoring systems can best be merged with LLMs to support hybrid tutoring settings through conversational assistance, facilitating effective caregiver involvement in tutoring systems.


Do Tutors Learn from Equity Training and Can Generative AI Assess It?

arXiv.org Artificial Intelligence

Equity is a core concern of learning analytics. However, applications that teach and assess equity skills, particularly at scale are lacking, often due to barriers in evaluating language. Advances in generative AI via large language models (LLMs) are being used in a wide range of applications, with this present work assessing its use in the equity domain. We evaluate tutor performance within an online lesson on enhancing tutors' skills when responding to students in potentially inequitable situations. We apply a mixed-method approach to analyze the performance of 81 undergraduate remote tutors. We find marginally significant learning gains with increases in tutors' self-reported confidence in their knowledge in responding to middle school students experiencing possible inequities from pretest to posttest. Both GPT-4o and GPT-4-turbo demonstrate proficiency in assessing tutors ability to predict and explain the best approach. Balancing performance, efficiency, and cost, we determine that few-shot learning using GPT-4o is the preferred model. This work makes available a dataset of lesson log data, tutor responses, rubrics for human annotation, and generative AI prompts. Future work involves leveling the difficulty among scenarios and enhancing LLM prompts for large-scale grading and assessment.


Does Multiple Choice Have a Future in the Age of Generative AI? A Posttest-only RCT

arXiv.org Artificial Intelligence

The role of multiple-choice questions (MCQs) as effective learning tools has been debated in past research. While MCQs are widely used due to their ease in grading, open response questions are increasingly used for instruction, given advances in large language models (LLMs) for automated grading. This study evaluates MCQs effectiveness relative to open-response questions, both individually and in combination, on learning. These activities are embedded within six tutor lessons on advocacy. Using a posttest-only randomized control design, we compare the performance of 234 tutors (790 lesson completions) across three conditions: MCQ only, open response only, and a combination of both. We find no significant learning differences across conditions at posttest, but tutors in the MCQ condition took significantly less time to complete instruction. These findings suggest that MCQs are as effective, and more efficient, than open response tasks for learning when practice time is limited. To further enhance efficiency, we autograded open responses using GPT-4o and GPT-4-turbo. GPT models demonstrate proficiency for purposes of low-stakes assessment, though further research is needed for broader use. This study contributes a dataset of lesson log data, human annotation rubrics, and LLM prompts to promote transparency and reproducibility.


A Systematic Review on Prompt Engineering in Large Language Models for K-12 STEM Education

arXiv.org Artificial Intelligence

The term "K-12" stands for "Kindergarten through 12th grade" and represents the full range of primary and secondary education. Within this system, a strong emphasis has been placed on STEM (Science, Technology, Engineering, and Mathematics) education as a means to prepare students for a technology-driven future. STEM education at the K-12 level focuses on building foundational knowledge in scientific inquiry, technological literacy, engineering principles, and mathematical reasoning [10, 29, 64]. The K-12 STEM education emphasizes interdisciplinary learning, where students apply concepts from multiple domains to solve real-world challenges, such as integrating mathematics with science to tackle engineering problems [29]. The importance of K-12 STEM education lies in its ability to prepare students for a rapidly evolving, technology-driven world by fostering critical thinking, creativity, and problem-solving skills from an early age [10]. Students who engage in well-structured STEM curricula are more likely to pursue further education and careers in high-demand fields like information technology and engineering which are essential for technological innovation. Additionally, K-12 STEM education equips students with competencies such as analytical thinking, which prepare them for a wide range of career paths while enabling them to tackle complex problems [64]. Recognizing the importance of STEM education at the K-12 level, it is essential to deliver K-12 STEM education at scale to ensure equitable access to individual students.


RAMO: Retrieval-Augmented Generation for Enhancing MOOCs Recommendations

arXiv.org Artificial Intelligence

Massive Open Online Courses (MOOCs) have significantly enhanced educational accessibility by offering a wide variety of courses and breaking down traditional barriers related to geography, finance, and time. However, students often face difficulties navigating the vast selection of courses, especially when exploring new fields of study. Driven by this challenge, researchers have been exploring course recommender systems to offer tailored guidance that aligns with individual learning preferences and career aspirations. These systems face particular challenges in effectively addressing the ``cold start'' problem for new users. Recent advancements in recommender systems suggest integrating large language models (LLMs) into the recommendation process to enhance personalized recommendations and address the ``cold start'' problem. Motivated by these advancements, our study introduces RAMO (Retrieval-Augmented Generation for MOOCs), a system specifically designed to overcome the ``cold start'' challenges of traditional course recommender systems. The RAMO system leverages the capabilities of LLMs, along with Retrieval-Augmented Generation (RAG)-facilitated contextual understanding, to provide course recommendations through a conversational interface, aiming to enhance the e-learning experience.


SPL: A Socratic Playground for Learning Powered by Large Language Model

arXiv.org Artificial Intelligence

Dialogue-based Intelligent Tutoring Systems (ITSs) have significantly advanced adaptive and personalized learning by automating sophisticated human tutoring strategies within interactive dialogues. However, replicating the nuanced patterns of expert human communication remains a challenge in Natural Language Processing (NLP). Recent advancements in NLP, particularly Large Language Models (LLMs) such as OpenAI's GPT-4, offer promising solutions by providing human-like and context-aware responses based on extensive pre-trained knowledge. Motivated by the effectiveness of LLMs in various educational tasks (e.g., content creation and summarization, problem-solving, and automated feedback provision), our study introduces the Socratic Playground for Learning (SPL), a dialogue-based ITS powered by the GPT-4 model, which employs the Socratic teaching method to foster critical thinking among learners. Through extensive prompt engineering, SPL can generate specific learning scenarios and facilitates efficient multi-turn tutoring dialogues. The SPL system aims to enhance personalized and adaptive learning experiences tailored to individual needs, specifically focusing on improving critical thinking skills. Our pilot experimental results from essay writing tasks demonstrate SPL has the potential to improve tutoring interactions and further enhance dialogue-based ITS functionalities. Our study, exemplified by SPL, demonstrates how LLMs enhance dialogue-based ITSs and expand the accessibility and efficacy of educational technologies.


How Can I Get It Right? Using GPT to Rephrase Incorrect Trainee Responses

arXiv.org Artificial Intelligence

One-on-one tutoring is widely acknowledged as an effective instructional method, conditioned on qualified tutors. However, the high demand for qualified tutors remains a challenge, often necessitating the training of novice tutors (i.e., trainees) to ensure effective tutoring. Research suggests that providing timely explanatory feedback can facilitate the training process for trainees. However, it presents challenges due to the time-consuming nature of assessing trainee performance by human experts. Inspired by the recent advancements of large language models (LLMs), our study employed the GPT-4 model to build an explanatory feedback system. This system identifies trainees' responses in binary form (i.e., correct/incorrect) and automatically provides template-based feedback with responses appropriately rephrased by the GPT-4 model. We conducted our study on 410 responses from trainees across three training lessons: Giving Effective Praise, Reacting to Errors, and Determining What Students Know. Our findings indicate that: 1) using a few-shot approach, the GPT-4 model effectively identifies correct/incorrect trainees' responses from three training lessons with an average F1 score of 0.84 and an AUC score of 0.85; and 2) using the few-shot approach, the GPT-4 model adeptly rephrases incorrect trainees' responses into desired responses, achieving performance comparable to that of human experts.