Instructional Material
Bridging LMS and generative AI: dynamic course content integration (DCCI) for enhancing student satisfaction and engagement via the ask ME assistant
Mzwri, Kovan, Turcsányi-Szabo, Márta
Integration of Large Language Models (LLMs) with Learning Management Systems (LMSs) can enhance task automation and accessibility in education. However, hallucination where LLMs generate inaccurate or misleading information remains a challenge. This study introduces the Dynamic Course Content Integration (DCCI) mechanism, which dynamically retrieves course content from Canvas LMS and structures it within an LLM's context window via prompt engineering, enabling the LLM-powered assistant, Ask ME, to deliver context-aware, curriculum-aligned responses while mitigating hallucinations. A mixed-methods pilot study grounded in Self-Determination Theory (autonomy, competence) and the Technology Acceptance Model (perceived usefulness, ease of use) evaluated DCCI's effectiveness with 120 first-year programming students at Eötvös Loránd University. The course focused on foundational programming patterns in C#, including writing program specifications. We analyzed 14,746 logged interactions and a post-course survey completed by 101 students. User satisfaction was measured via a 5-point Likert scale (turn-level ratings), while the survey assessed usability, engagement, and ethical concerns. Results indicated high satisfaction (mean 4.65/5) and strong recognition of Ask ME's ability to provide timely, contextually relevant answers to administrative and course-related queries. 78.06% agreed that Ask ME's Canvas integration reduced platform switching, improving usability, engagement, comprehension, and topic exploration. Many students reported reduced hesitation to ask questions and increased motivation for self-directed learning, though concerns about over-reliance on AI and reduced student-teacher interaction emerged. This study demonstrates that DCCI enhances LLM reliability, student satisfaction, and engagement in AI-driven educational automation, while highlighting the importance of balancing
Robot Crash Course: Learning Soft and Stylized Falling
Strauch, Pascal, Müller, David, Christen, Sammy, Serifi, Agon, Grandia, Ruben, Knoop, Espen, Bächer, Moritz
Despite recent advances in robust locomotion, bipedal robots operating in the real world remain at risk of falling. While most research focuses on preventing such events, we instead concentrate on the phenomenon of falling itself. Specifically, we aim to reduce physical damage to the robot while providing users with control over a robot's end pose. To this end, we propose a robot agnostic reward function that balances the achievement of a desired end pose with impact minimization and the protection of critical robot parts during reinforcement learning. To make the policy robust to a broad range of initial falling conditions and to enable the specification of an arbitrary and unseen end pose at inference time, we introduce a simulation-based sampling strategy of initial and end poses. Through simulated and real-world experiments, our work demonstrates that even bipedal robots can perform controlled, soft falls.
Torch-Uncertainty: A Deep Learning Framework for Uncertainty Quantification
Lafage, Adrien, Laurent, Olivier, Gabetni, Firas, Franchi, Gianni
Deep Neural Networks (DNNs) have demonstrated remarkable performance across various domains, including computer vision and natural language processing. However, they often struggle to accurately quantify the uncertainty of their predictions, limiting their broader adoption in critical real-world applications. Uncertainty Quantification (UQ) for Deep Learning seeks to address this challenge by providing methods to improve the reliability of uncertainty estimates. Although numerous techniques have been proposed, a unified tool offering a seamless workflow to evaluate and integrate these methods remains lacking. To bridge this gap, we introduce Torch-Uncertainty, a PyTorch and Lightning-based framework designed to streamline DNN training and evaluation with UQ techniques and metrics. In this paper, we outline the foundational principles of our library and present comprehensive experimental results that benchmark a diverse set of UQ methods across classification, segmentation, and regression tasks. Our library is available at https://github.com/ENSTA-U2IS-AI/Torch-Uncertainty
Owlgorithm: Supporting Self-Regulated Learning in Competitive Programming through LLM-Driven Reflection
Nieto-Cardenas, Juliana, Kramer, Erin Joy, Kurto, Peter, Dickey, Ethan, Bejarano, Andres
We present Owlgorithm, an educational platform that supports Self-Regulated Learning (SRL) in competitive programming (CP) through AI-generated reflective questions. Leveraging GPT-4o, Owlgorithm produces context-aware, metacognitive prompts tailored to individual student submissions. Integrated into a second- and third-year CP course, the system-provided reflective prompts adapted to student outcomes: guiding deeper conceptual insight for correct solutions and structured debugging for partial or failed ones. Our exploratory assessment of student ratings and TA feedback revealed both promising benefits and notable limitations. While many found the generated questions useful for reflection and debugging, concerns were raised about feedback accuracy and classroom usability. These results suggest advantages of LLM-supported reflection for novice programmers, though refinements are needed to ensure reliability and pedagogical value for advanced learners. From our experience, several key insights emerged: GenAI can effectively support structured reflection, but careful prompt design, dynamic adaptation, and usability improvements are critical to realizing their potential in education. We offer specific recommendations for educators using similar tools and outline next steps to enhance Owlgorithm's educational impact. The underlying framework may also generalize to other reflective learning contexts.
Improving Graduate Outcomes by Identifying Skills Gaps and Recommending Courses Based on Career Interests
Soni, Rahul, Suleiman, Basem, Singh, Sonit
Abstract--This paper aims to address the challenge of selecting relevant courses for students by proposing the design and development of a course recommendation system. The course recommendation system utilises a combination of data analytics techniques and machine learning algorithms to recommend courses that align with current industry trends and requirements. In order to provide customised suggestions, the study entails the design and implementation of an extensive algorithmic framework that combines machine learning methods, user preferences, and academic criteria. The system employs data mining and collaborative filtering techniques to examine past courses and individual career goals in order to provide course recommendations. Moreover, to improve the accessibility and usefulness of the recommendation system, special attention is given to the development of an easy-to-use front-end interface. We refined and optimised the proposed system by incorporating user feedback, ensuring that it effectively meets the needs and preferences of its target users. The proposed course recommendation system could be a useful tool for students, instructors, and career advisers to use in promoting lifelong learning and professional progression as it fills the gap between university learning and industry expectations. We hope that the proposed course recommendation system will help university students in making data-drive and industry-informed course decisions, in turn, improving graduate outcomes for the university sector .
A Critical Review of the Need for Knowledge-Centric Evaluation of Quranic Recitation
Al-Kharusi, Mohammed Hilal, Hayat, Khizar, Ruqeishi, Khalil Bader Al, Lone, Haroon Rashid
The art and science of Quranic recitation (Tajweed), a discipline governed by meticulous phonetic, rhythmic, and theological principles, confronts substantial educational challenges in today's digital age. Although modern technology offers unparalleled opportunities for learning, existing automated systems for evaluating recitation have struggled to gain broad acceptance or demonstrate educational effectiveness. This literature review examines this crucial disparity, offering a thorough analysis of scholarly research, digital platforms, and commercial tools developed over the past twenty years. Our analysis uncovers a fundamental flaw in current approaches that adapt Automatic Speech Recognition (ASR) systems, which emphasize word identification over qualitative acoustic evaluation. These systems suffer from limitations such as reliance on biased datasets, demographic disparities, and an inability to deliver meaningful feedback for improvement. Challenging these data-centric methodologies, we advocate for a paradigm shift toward a knowledge-based computational framework. By leveraging the unchanging nature of the Quranic text and the well-defined rules of Tajweed, we propose that an effective evaluation system should be built upon rule-based acoustic modeling centered on canonical pronunciation principles and articulation points (Makhraj), rather than depending on statistical patterns derived from flawed or biased data. The review concludes that the future of automated Quranic recitation assessment lies in hybrid systems that combine linguistic expertise with advanced audio processing. Such an approach paves the way for developing reliable, fair, and pedagogically effective tools that can authentically assist learners across the globe.
Ax-Prover: A Deep Reasoning Agentic Framework for Theorem Proving in Mathematics and Quantum Physics
Breen, Benjamin, Del Tredici, Marco, McCarran, Jacob, Mijares, Javier Aspuru, Yin, Weichen Winston, Sulimany, Kfir, Taylor, Jacob M., Koppens, Frank H. L., Englund, Dirk
We present Ax-Prover, a multi-agent system for automated theorem proving in Lean that can solve problems across diverse scientific domains and operate either autonomously or collaboratively with human experts. To achieve this, Ax-Prover approaches scientific problem solving through formal proof generation, a process that demands both creative reasoning and strict syntactic rigor. Ax-Prover meets this challenge by equipping Large Language Models (LLMs), which provide knowledge and reasoning, with Lean tools via the Model Context Protocol (MCP), which ensure formal correctness. To evaluate its performance as an autonomous prover, we benchmark our approach against frontier LLMs and specialized prover models on two public math benchmarks and on two Lean benchmarks we introduce in the fields of abstract algebra and quantum theory. On public datasets, Ax-Prover is competitive with state-of-the-art provers, while it largely outperforms them on the new benchmarks. This shows that, unlike specialized systems that struggle to generalize, our tool-based agentic theorem prover approach offers a generalizable methodology for formal verification across diverse scientific domains. Furthermore, we demonstrate Ax-Prover's assistant capabilities in a practical use case, showing how it enabled an expert mathematician to formalize the proof of a complex cryptography theorem.
Artificial-Intelligence Grading Assistance for Handwritten Components of a Calculus Exam
Kortemeyer, Gerd, Caspar, Alexander, Horica, Daria
We investigate whether contemporary multimodal LLMs can assist with grading open-ended calculus at scale without eroding validity. In a large first-year exam, students' handwritten work was graded by GPT-5 against the same rubric used by teaching assistants (TAs), with fractional credit permitted; TA rubric decisions served as ground truth. We calibrated a human-in-the-loop filter that combines a partial-credit threshold with an Item Response Theory (2PL) risk measure based on the deviation between the AI score and the model-expected score for each student-item. Unfiltered AI-TA agreement was moderate, adequate for low-stakes feedback but not for high-stakes use. Confidence filtering made the workload-quality trade-off explicit: under stricter settings, AI delivered human-level accuracy, but also left roughly 70% of the items to be graded by humans. Psychometric patterns were constrained by low stakes on the open-ended portion, a small set of rubric checkpoints, and occasional misalignment between designated answer regions and where work appeared. Practical adjustments such as slightly higher weight and protected time, a few rubric-visible substeps, stronger spatial anchoring should raise ceiling performance. Overall, calibrated confidence and conservative routing enable AI to reliably handle a sizable subset of routine cases while reserving expert judgment for ambiguous or pedagogically rich responses.
Report from Workshop on Dialogue alongside Artificial Intelligence
McKenna, Thomas J, Rasmussen, Ingvill, Ludvigsen, Sten, Arvatz, Avivit, Asterhan, Christa, Chen, Gaowei, Cohen, Julie, Flammia, Michele, Han, Dongkeun, Hayward, Emma, Hill, Heather, Kolikant, Yifat, Lehndorf, Helen, Li, Kexin, Matsumura, Lindsay Clare, Tjønn, Henrik, Wang, Pengjin, Wegerif, Rupert
Educational dialogue -- the collaborative exchange of ideas through talk -- is widely recognized as a catalyst for deeper learning and critical thinking in and across contexts. At the same time, artificial intelligence (AI) has rapidly emerged as a powerful force in education, with the potential to address major challenges, personalize learning, and innovate teaching practices. However, these advances come with significant risks: rapid AI development can undermine human agency, exacerbate inequities, and outpace our capacity to guide its use with sound policy. Human learning presupposes cognitive efforts and social interaction (dialogues). In response to this evolving landscape, an international workshop titled "Educational Dialogue: Moving Thinking Forward" convened 19 leading researchers from 11 countries in Cambridge (September 1-3, 2025) to examine the intersection of AI and educational dialogue. This AI-focused strand of the workshop centered on three critical questions: (1) When is AI truly useful in education, and when might it merely replace human effort at the expense of learning? (2) Under what conditions can AI use lead to better dialogic teaching and learning? (3) Does the AI-human partnership risk outpacing and displacing human educational work, and what are the implications? These questions framed two days of presentations and structured dialogue among participants.
AutoSynth: Automated Workflow Optimization for High-Quality Synthetic Dataset Generation via Monte Carlo Tree Search
Bi, Shuzhen, Song, Chang, Song, Siyu, Lv, Jinze, Chen, Jian, Wang, Xinyun, Zhou, Aimin, Hao, Hao
Four-Period Detailed Design Period 1: Topic Selection and Initial Exploration Period 2: Principle Analysis and Model Design Period 3: Model Construction and Refinement Period 4: "Historical Technology Expo" with presentations [Includes detailed student reflection prompts, extension activities, and troubleshooting guidance...] Base Model: Generic Outline Interdisciplinary Lesson Plan Design Learning Objectives: Help students understand how physics influences historical progress... Cultivate ability to analyze social factors behind technological development... Class Schedule: Four periods covering physics review, historical technologies, case study, and modern applications. Assessment: Class participation, group reports, reflection journals [Subsequent periods contain only high-level bullet points without actionable details...] 12 Qualitative Analysis This comparison reveals dramatic capability differences for complex generation tasks. The Base Model produces only a generic outline with vague bullet points--entirely insufficient for classroom use. Both AutoSynth and Expert-Designed models generate outstanding, comprehensive lesson plans with detailed objectives, granular activities, and sophisticated assessment schemes. The subtle differences reflect their optimization processes: AutoSynth emphasizes systematic difficulty coverage (likely from iterative refinement), while Expert-Designed showcases deep assessment design expertise. Both represent quantum leaps over baseline, validating that specialized workflows-- automated or manual--are essential for professional-grade content. This supports our quantitative findings (Table 1): while Au-toSynth achieves lower human preference (51 percent vs 96 percent), it produces genuinely high-quality outputs far superior to baseline capabilities.