worksheet
FACET: Teacher-Centred LLM-Based Multi-Agent Systems-Towards Personalized Educational Worksheets
Gonnermann-Müller, Jana, Haase, Jennifer, Fackeldey, Konstantin, Pokutta, Sebastian
The increasing heterogeneity of student populations poses significant challenges for teachers, particularly in mathematics education, where cognitive, motivational, and emotional differences strongly influence learning outcomes. While AI-driven personalization tools have emerged, most remain performance-focused, offering limited support for teachers and neglecting broader pedagogical needs. This paper presents the FACET framework, a teacher-facing, large language model (LLM)-based multi-agent system designed to generate individualized classroom materials that integrate both cognitive and motivational dimensions of learner profiles. The framework comprises three specialized agents: (1) learner agents that simulate diverse profiles incorporating topic proficiency and intrinsic motivation, (2) a teacher agent that adapts instructional content according to didactical principles, and (3) an evaluator agent that provides automated quality assurance. We tested the system using authentic grade 8 mathematics curriculum content and evaluated its feasibility through a) automated agent-based assessment of output quality and b) exploratory feedback from K-12 in-service teachers. Results from ten internal evaluations highlighted high stability and alignment between generated materials and learner profiles, and teacher feedback particularly highlighted structure and suitability of tasks. The findings demonstrate the potential of multi-agent LLM architectures to provide scalable, context-aware personalization in heterogeneous classroom settings, and outline directions for extending the framework to richer learner profiles and real-world classroom trials.
- North America > United States (0.14)
- Europe > Germany > Berlin (0.04)
- Europe > Germany > Brandenburg > Potsdam (0.04)
- (5 more...)
- Research Report > New Finding (1.00)
- Instructional Material (1.00)
- Education > Educational Setting (1.00)
- Education > Curriculum > Subject-Specific Education (1.00)
- Education > Educational Technology > Educational Software > Computer Based Training (0.94)
The AI Takeover of Education Is Just Getting Started
Rising seniors are the last class of students who remember high school before ChatGPT. But only just barely: OpenAI's chatbot was released months into their freshman year. Ever since then, writing essays hasn't required, well, writing. By the time these students graduate next spring, they will have completed almost four full years of AI high school. Gone already are the days when using AI to write an essay meant copying and pasting its response verbatim.
- North America > United States > Texas > Harris County > Houston (0.05)
- North America > United States > New York (0.05)
- North America > United States > Iowa (0.05)
- North America > United States > California > Sacramento County > Sacramento (0.05)
Exploring Moral Exercises for Human Oversight of AI systems: Insights from Three Pilot Studies
Crafa, Silvia, Scantamburlo, Teresa
This paper elaborates on the concept of moral exercises as a means to help AI actors cultivate virtues that enable effective human oversight of AI systems. We explore the conceptual framework and significance of moral exercises, situating them within the contexts of philosophical discourse, ancient practices, and contemporary AI ethics scholarship. We outline the core pillars of the moral exercises methodology -- eliciting an engaged personal disposition, fostering relational understanding, and cultivating technomoral wisdom -- and emphasize their relevance to key activities and competencies essential for human oversight of AI systems. Our argument is supported by findings from three pilot studies involving a company, a multidisciplinary team of AI researchers, and higher education students. These studies allow us to explore both the potential and the limitations of moral exercises. Based on the collected data, we offer insights into how moral exercises can foster a responsible AI culture within organizations, and suggest directions for future research.
- North America > United States > Virginia (0.04)
- North America > United States > New Jersey > Mercer County > Princeton (0.04)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- (5 more...)
I was in gifted classes as a kid in the 90s... here's why I think it was a secret CIA program
Americans who were part of'gifted and talented' education programs in the 1980s and 1990s believe they were part of a secret government intelligence program. The Gifted And Talented Education program (GATE) provides students with advanced curriculum and activities to foster creativity and critical thinking skills. But many former students believe they were actually part of a secret CIA initiative to test the supernatural abilities of children with above average intelligence. One woman, who claimed to be part of the program in the 1990s, shared a workbook she purportedly used during class, showing she was cracking codes and learning Russian. 'The stuff I found in there -- I'm like, what were you training us for?' she said. Some former GATE students argued that the program was tied to the CIA's Gateway Program that was developed in the 1980s to explore the limitations of human consciousness using sound, meditation and other techniques.
Problem-Oriented Segmentation and Retrieval: Case Study on Tutoring Conversations
Wang, Rose E., Wirawarn, Pawan, Lam, Kenny, Khattab, Omar, Demszky, Dorottya
Many open-ended conversations (e.g., tutoring lessons or business meetings) revolve around pre-defined reference materials, like worksheets or meeting bullets. To provide a framework for studying such conversation structure, we introduce Problem-Oriented Segmentation & Retrieval (POSR), the task of jointly breaking down conversations into segments and linking each segment to the relevant reference item. As a case study, we apply POSR to education where effectively structuring lessons around problems is critical yet difficult. We present LessonLink, the first dataset of real-world tutoring lessons, featuring 3,500 segments, spanning 24,300 minutes of instruction and linked to 116 SAT math problems. We define and evaluate several joint and independent approaches for POSR, including segmentation (e.g., TextTiling), retrieval (e.g., ColBERT), and large language models (LLMs) methods. Our results highlight that modeling POSR as one joint task is essential: POSR methods outperform independent segmentation and retrieval pipelines by up to +76% on joint metrics and surpass traditional segmentation methods by up to +78% on segmentation metrics. We demonstrate POSR's practical impact on downstream education applications, deriving new insights on the language and time use in real-world lesson structures.
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- Europe > United Kingdom > England > West Midlands > Birmingham (0.04)
- Europe > Middle East > Malta > Eastern Region > Northern Harbour District > St. Julian's (0.04)
- (4 more...)
Developing a Tutoring Dialog Dataset to Optimize LLMs for Educational Use
Fateen, Menna, Mine, Tsunenori
Recent advances in large language models (LLMs) have shown promise for scalable educational applications, but their use in dialog-based tutoring systems remains challenging due to the need for effective pedagogical strategies and the high costs associated with expert-curated datasets. Our study explores the use of smaller, more affordable LLMs for one-on-one tutoring in the context of solving reading comprehension problems. We developed a synthetic tutoring dialog dataset, evaluated by human teachers, and fine-tuned a smaller LLM using this dataset. Furthermore, we conducted an interactive experiment comparing the performance of the fine-tuned model with a larger model in real-world tutoring scenarios. Our results show that the fine-tuned model performs on par with the larger model but at a lower cost, demonstrating a viable, cost-effective approach for implementing LLM-based tutoring systems in educational settings.
- Asia > Middle East > Jordan (0.05)
- Oceania > Australia (0.04)
- Asia > Japan > Kyūshū & Okinawa > Kyūshū > Fukuoka Prefecture > Fukuoka (0.04)
- (5 more...)
- Education > Educational Setting (1.00)
- Education > Educational Technology > Educational Software (0.54)
- Education > Curriculum > Subject-Specific Education (0.46)
- Education > Assessment & Standards > Student Performance (0.35)
LLM-Based Open-Domain Integrated Task and Knowledge Assistants with Programmable Policies
Joshi, Harshit, Liu, Shicheng, Chen, James, Weigle, Robert, Lam, Monica S.
Programming LLM-based knowledge and task assistants that faithfully conform to developer-provided policies is challenging. These agents must retrieve and provide consistent, accurate, and relevant information to address user's queries and needs. Yet such agents generate unfounded responses ("hallucinate"). Traditional dialogue trees can only handle a limited number of conversation flows, making them inherently brittle. To this end, we present KITA - a programmable framework for creating task-oriented conversational agents that are designed to handle complex user interactions. Unlike LLMs, KITA provides reliable grounded responses, with controllable agent policies through its expressive specification, KITA Worksheet. In contrast to dialog trees, it is resilient to diverse user queries, helpful with knowledge sources, and offers ease of programming policies through its declarative paradigm. Through a real-user study involving 62 participants, we show that KITA beats the GPT-4 with function calling baseline by 26.1, 22.5, and 52.4 points on execution accuracy, dialogue act accuracy, and goal completion rate, respectively. We also release 22 real-user conversations with KITA manually corrected to ensure accuracy.
- North America > United States > California > San Francisco County > San Francisco (0.15)
- Asia > Vietnam > Hồ Chí Minh City > Hồ Chí Minh City (0.04)
- Asia > Singapore (0.04)
- (6 more...)
- Research Report (1.00)
- Instructional Material > Course Syllabus & Notes (1.00)
- Information Technology (1.00)
- Health & Medicine (1.00)
- Education (1.00)
- (2 more...)
An image speaks a thousand words, but can everyone listen? On image transcreation for cultural relevance
Khanuja, Simran, Ramamoorthy, Sathyanarayanan, Song, Yueqi, Neubig, Graham
Given the rise of multimedia content, human translators increasingly focus on culturally adapting not only words but also other modalities such as images to convey the same meaning. While several applications stand to benefit from this, machine translation systems remain confined to dealing with language in speech and text. In this work, we take a first step towards translating images to make them culturally relevant. First, we build three pipelines comprising state-of-the-art generative models to do the task. Next, we build a two-part evaluation dataset: i) concept: comprising 600 images that are cross-culturally coherent, focusing on a single concept per image, and ii) application: comprising 100 images curated from real-world applications. We conduct a multi-faceted human evaluation of translated images to assess for cultural relevance and meaning preservation. We find that as of today, image-editing models fail at this task, but can be improved by leveraging LLMs and retrievers in the loop. Best pipelines can only translate 5% of images for some countries in the easier concept dataset and no translation is successful for some countries in the application dataset, highlighting the challenging nature of the task. Our code and data is released here: https://github.com/simran-khanuja/image-transcreation.
- Asia > Japan (0.06)
- Africa > Nigeria (0.05)
- South America > Brazil (0.05)
- (9 more...)
- Media (0.67)
- Consumer Products & Services > Food, Beverage, Tobacco & Cannabis (0.46)
- Education (0.46)
ECBD: Evidence-Centered Benchmark Design for NLP
Liu, Yu Lu, Blodgett, Su Lin, Cheung, Jackie Chi Kit, Liao, Q. Vera, Olteanu, Alexandra, Xiao, Ziang
Benchmarking is seen as critical to assessing progress in NLP. However, creating a benchmark involves many design decisions (e.g., which datasets to include, which metrics to use) that often rely on tacit, untested assumptions about what the benchmark is intended to measure or is actually measuring. There is currently no principled way of analyzing these decisions and how they impact the validity of the benchmark's measurements. To address this gap, we draw on evidence-centered design in educational assessments and propose Evidence-Centered Benchmark Design (ECBD), a framework which formalizes the benchmark design process into five modules. ECBD specifies the role each module plays in helping practitioners collect evidence about capabilities of interest. Specifically, each module requires benchmark designers to describe, justify, and support benchmark design choices -- e.g., clearly specifying the capabilities the benchmark aims to measure or how evidence about those capabilities is collected from model responses. To demonstrate the use of ECBD, we conduct case studies with three benchmarks: BoolQ, SuperGLUE, and HELM. Our analysis reveals common trends in benchmark design and documentation that could threaten the validity of benchmarks' measurements.
- North America > Canada > Quebec > Montreal (0.14)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > Canada > Ontario > Toronto (0.04)
- (8 more...)
Evaluating and Optimizing Educational Content with Large Language Model Judgments
He-Yueya, Joy, Goodman, Noah D., Brunskill, Emma
Creating effective educational materials generally requires expensive and time-consuming studies of student learning outcomes. To overcome this barrier, one idea is to build computational models of student learning and use them to optimize instructional materials. However, it is difficult to model the cognitive processes of learning dynamics. We propose an alternative approach that uses Language Models (LMs) as educational experts to assess the impact of various instructions on learning outcomes. Specifically, we use GPT-3.5 to evaluate the overall effect of instructional materials on different student groups and find that it can replicate well-established educational findings such as the Expertise Reversal Effect and the Variability Effect. This demonstrates the potential of LMs as reliable evaluators of educational content. Building on this insight, we introduce an instruction optimization approach in which one LM generates instructional materials using the judgments of another LM as a reward function. We apply this approach to create math word problem worksheets aimed at maximizing student learning gains. Human teachers' evaluations of these LM-generated worksheets show a significant alignment between the LM judgments and human teacher preferences. We conclude by discussing potential divergences between human and LM opinions and the resulting pitfalls of automating instructional design.
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- Research Report > Experimental Study (1.00)
- Research Report > New Finding (0.93)
- Instructional Material > Course Syllabus & Notes (0.66)
- Education > Assessment & Standards (1.00)
- Education > Educational Setting > K-12 Education (0.68)
- Education > Educational Technology > Educational Software > Computer Based Training (0.68)