Instructional Material
Trustworthiness of Legal Considerations for the Use of LLMs in Education
Alaswad, Sara, Kalganova, Tatiana, Awad, Wasan
As Artificial Intelligence (AI), particularly Large Language Models (LLMs), becomes increasingly embedded in education systems worldwide, ensuring their ethical, legal, and contextually appropriate deployment has become a critical policy concern. This paper offers a comparative analysis of AI-related regulatory and ethical frameworks across key global regions, including the European Union, United Kingdom, United States, China, and Gulf Cooperation Council (GCC) countries. It maps how core trustworthiness principles, such as transparency, fairness, accountability, data privacy, and human oversight are embedded in regional legislation and AI governance structures. Special emphasis is placed on the evolving landscape in the GCC, where countries are rapidly advancing national AI strategies and education-sector innovation. To support this development, the paper introduces a Compliance-Centered AI Governance Framework tailored to the GCC context. This includes a tiered typology and institutional checklist designed to help regulators, educators, and developers align AI adoption with both international norms and local values. By synthesizing global best practices with region-specific challenges, the paper contributes practical guidance for building legally sound, ethically grounded, and culturally sensitive AI systems in education. These insights are intended to inform future regulatory harmonization and promote responsible AI integration across diverse educational environments.
Beyond Least Squares: Robust Regression Transformer (R2T)
Gutierrez, Roman, Tang, Tony Kai, Gutierrez, Isabel
Robust regression techniques rely on least-squares optimization, which works well for Gaussian noise but fails in the presence of asymmetric structured noise. We propose a hybrid neural-symbolic architecture where a transformer encoder processes numerical sequences, a compression NN predicts symbolic parameters, and a fixed symbolic equation reconstructs the original sequence. Using synthetic data, the training objective is to recover the original sequence after adding asymmetric structured noise, effectively learning a symbolic fit guided by neural parameter estimation. Our model achieves a median regression MSE of 6e-6 to 3.5e-5 on synthetic wearable data, which is a 10-300 times improvement when compared with ordinary least squares fit and robust regression techniques such as Huber loss or SoftL1.
VQA support to Arabic Language Learning Educational Tool
Delassi, Khaled Bachir, Zeggane, Lakhdar, Cherroun, Hadda, Haouhat, Abdelhamid, Bouzouad, Kaoutar
--W e address the problem of scarcity of educational Arabic Language Learning tools that advocates modern pedagogical models such active learning which ensures language proficiency . In fact, we investigate the design and evaluation of an AI-powered educational tool designed to enhance Arabic language learning for non-native speakers with beginner-to-intermediate proficiency level. The tool leverages advanced AI models to generate interactive visual quizzes, deploying Visual Question Answering as the primary activity . Adopting a constructivist learning approach, the system encourages active learning through real-life visual quizzes, and image-based questions that focus on improving vocabulary, grammar, and comprehension. The system integrates Vision-Language Pretraining models to generate contextually relevant image description from which Large Language Model generate assignments based on customized Arabic language Learning quizzes thanks to prompting. The effectiveness of the tool is evaluated through a manual annotated benchmark consisting of 1266 real-life visual quizzes, with human participants providing feedback. The results show a suitable accuracy rates, validating the tool's potential to bridge the gap in Arabic language education and highlighting the tool's promise as a reliable, AI-powered resource for Arabic learners, offering personalized and interactive learning experiences. I. Introduction Language learning has never been more important than it is today. Since the onset of globalization, language learning has become essential in facilitating communication across cultures and opening up numerous educational and professional opportunities [6]. To excel in any language, it is crucial to develop proficiency in all four core skills: listening, writing, reading, and speaking.
Co-designing Zoomorphic Robot Concepts for Animal Welfare Education
Voysey, Isobel, Baillie, Lynne, Williams, Joanne, Herrmann, Michael
Animal welfare education could greatly benefit from customized robots to help children learn about animals and their behavior, and thereby promote positive, safe child-animal interactions. To this end, we ran Participatory Design workshops with animal welfare educators and children to identify key requirements for zoomorphic robots from their perspectives. Our findings encompass a zoomorphic robot's appearance, behavior, and features, as well as concepts for a narrative surrounding the robot. Through comparing and contrasting the two groups, we find the importance of: negative reactions to undesirable behavior from children; using the facial features and tail to provide cues signaling an animal's internal state; and a natural, furry appearance and texture. We also contribute some novel activities for Participatory Design with children, including branching storyboards inspired by thematic apperception tests and interactive narratives, and reflect on some of the key design challenges of achieving consensus between the groups, despite much overlap in their design concepts.
Teaching at Scale: Leveraging AI to Evaluate and Elevate Engineering Education
Chamberland, Jean-Francois, Carlisle, Martin C., Jayaraman, Arul, Narayanan, Krishna R., Palsole, Sunay, Watson, Karan
Evaluating teaching effectiveness at scale remains a persistent challenge for large universities, particularly within engineering programs that enroll tens of thousands of students. Traditional methods, such as manual review of student evaluations, are often impractical, leading to overlooked insights and inconsistent data use. This article presents a scalable, AI-supported framework for synthesizing qualitative student feedback using large language models. The system employs hierarchical summarization, anonymization, and exception handling to extract actionable themes from open-ended comments while upholding ethical safeguards. Visual analytics contextualize numeric scores through percentile-based comparisons, historical trends, and instructional load. The approach supports meaningful evaluation and aligns with best practices in qualitative analysis and educational assessment, incorporating student, peer, and self-reflective inputs without automating personnel decisions. We report on its successful deployment across a large college of engineering. Preliminary validation through comparisons with human reviewers, faculty feedback, and longitudinal analysis suggests that LLM-generated summaries can reliably support formative evaluation and professional development. This work demonstrates how AI systems, when designed with transparency and shared governance, can promote teaching excellence and continuous improvement at scale within academic institutions.
Mathematical Foundations of Geometric Deep Learning
Borde, Haitz Sรกez de Ocรกriz, Bronstein, Michael
Since the dawn of civilization, humans have tried to understand the nature of intelligence. With the advent of computers, there have been attempts to emulate human intelligence using computer algorithms - a field that was dubbed'Artificial Intelligence' or'AI' by the computer scientist John McCarthy in 1956 and has recently enjoyed an explosion of popularity. Many efforts in AI research have focused on the study and replication of what is considered the hallmark of human cognition, such as playing intelligent games, the faculty of language, visual perception, and creativity. While at the time of writing we have multiple successful takes at the above - computers nowadays play chess and Go better than any human, can translate English into Chinese without a dictionary, automatically drive a car in a crowded city, and generate poetry and art that wins artistic competitions - it is fair to say that we still do not have a full understanding of what human-like or'general' intelligence entails and how to replicate it.
FedCD: A Fairness-aware Federated Cognitive Diagnosis Framework
Yang, Shangshang, Han, Jialin, Yu, Xiaoshan, Wang, Ziwen, Jiang, Hao, Ma, Haiping, Zhang, Xingyi, Min, Geyong
Online intelligent education platforms have generated a vast amount of distributed student learning data. This influx of data presents opportunities for cognitive diagnosis (CD) to assess students' mastery of knowledge concepts while also raising significant data privacy and security challenges. To cope with this issue, federated learning (FL) becomes a promising solution by jointly training models across multiple local clients without sharing their original data. However, the data quality problem, caused by the ability differences and educational context differences between different groups/schools of students, further poses a challenge to the fairness of models. To address this challenge, this paper proposes a fairness-aware federated cognitive diagnosis framework (FedCD) to jointly train CD models built upon a novel parameter decoupling-based personalization strategy, preserving privacy of data and achieving precise and fair diagnosis of students on each client. As an FL paradigm, FedCD trains a local CD model for the students in each client based on its local student learning data, and each client uploads its partial model parameters to the central server for parameter aggregation according to the devised innovative personalization strategy. The main idea of this strategy is to decouple model parameters into two parts: the first is used as locally personalized parameters, containing diagnostic function-related model parameters, to diagnose each client's students fairly; the second is the globally shared parameters across clients and the server, containing exercise embedding parameters, which are updated via fairness-aware aggregation, to alleviate inter-school unfairness. Experiments on three real-world datasets demonstrate the effectiveness of the proposed FedCD framework and the personalization strategy compared to five FL approaches under three CD models.
Greening AI-enabled Systems with Software Engineering: A Research Agenda for Environmentally Sustainable AI Practices
Cruz, Luรญs, Fernandes, Joรฃo Paulo, Kirkeby, Maja H., Martรญnez-Fernรกndez, Silverio, Sallou, June, Anwar, Hina, Roque, Enrique Barba, Bogner, Justus, Castaรฑo, Joel, Castor, Fernando, Chasmawala, Aadil, Cunha, Simรฃo, Feitosa, Daniel, Gonzรกlez, Alexandra, Jedlitschka, Andreas, Lago, Patricia, Muccini, Henry, Oprescu, Ana, Rani, Pooja, Saraiva, Joรฃo, Sarro, Federica, Selvan, Raghavendra, Vaidhyanathan, Karthik, Verdecchia, Roberto, Yamshchikov, Ivan P.
The environmental impact of Artificial Intelligence (AI)-enabled systems is increasing rapidly, and software engineering plays a critical role in developing sustainable solutions. The "Greening AI with Software Engineering" CECAM-Lorentz workshop (no. 1358, 2025) funded by the Centre Europรฉen de Calcul Atomique et Molรฉculaire and the Lorentz Center, provided an interdisciplinary forum for 29 participants, from practitioners to academics, to share knowledge, ideas, practices, and current results dedicated to advancing green software and AI research. The workshop was held February 3-7, 2025, in Lausanne, Switzerland. Through keynotes, flash talks, and collaborative discussions, participants identified and prioritized key challenges for the field. These included energy assessment and standardization, benchmarking practices, sustainability-aware architectures, runtime adaptation, empirical methodologies, and education. This report presents a research agenda emerging from the workshop, outlining open research directions and practical recommendations to guide the development of environmentally sustainable AI-enabled systems rooted in software engineering principles.
Assessing the Reliability and Validity of Large Language Models for Automated Assessment of Student Essays in Higher Education
Gaggioli, Andrea, Casaburi, Giuseppe, Ercolani, Leonardo, Collova', Francesco, Torre, Pietro, Davide, Fabrizio
This study investigates the reliability and validity of five advanced Large Language Models (LLMs)--Claude 3.5, DeepSeek v2, Gemini 2.5, GPT 4, and Mistral 24B--for automated essay scoring in a real-world higher education context. A total of 67 Italian-language student essays, written as part of a university psychology course, were evaluated using a four-criterion rubric (Pertinence, Coherence, Originality, Feasibility). Each model scored all essays across three prompt replications to assess intra-model stability. Human-LLM agreement was consistently low and non-significant (Quadratic Weighted Kappa), and within-model reliability across replications was similarly weak (median Kendall's W < .30). Systematic scoring divergences emerged, including a tendency to inflate Coherence and inconsistent handling of context-dependent dimensions. Inter-model agreement analysis revealed moderate convergence for Coherence and Originality, but negligible concordance for Pertinence and Feasibility. Although limited in scope, these findings suggest that current LLMs may struggle to replicate human judgment in tasks requiring disciplinary insight and contextual sensitivity. Human oversight remains critical when evaluating open-ended academic work, particularly in interpretive domains.
TRACEALIGN -- Tracing the Drift: Attributing Alignment Failures to Training-Time Belief Sources in LLMs
Das, Amitava, Jain, Vinija, Chadha, Aman
Large Language Models (LLMs) fine-tuned to align with human values often exhibit alignment drift, producing unsafe or policy-violating completions when exposed to adversarial prompts, decoding perturbations, or paraphrased jailbreaks. While prior work has behaviorally characterized alignment failure, little is known about the training-time belief sources underlying these failures. We introduce TraceAlign, a unified framework for tracing unsafe completions back to their root causes in the model's training corpus. Central to our approach is the Belief Conflict Index (BCI), which quantifies semantic inconsistency between generated spans and aligned policies, based on retrieved training documents using suffix-array matching. We propose three complementary interventions: (i) TraceShield, an inference-time safety filter that refuses completions with high-BCI spans, (ii) Contrastive Belief Deconfliction Loss, a contrastive fine-tuning objective penalizing high-BCI continuations during DPO, and (iii) Prov-Decode, a provenance-aware decoding strategy that vetoes beam expansions predicted to yield high-BCI spans. Together, these defenses reduce alignment drift by up to 85% on our curated Alignment Drift Benchmark (ADB) while preserving utility on standard tasks, with delta less than 0.2 and improved refusal quality. We further derive a theoretical upper bound on drift likelihood via suffix-array span statistics, linking memorization frequency and length to adversarial reactivation risk. TraceAlign thus provides the first scalable, traceable, and grounded toolkit for understanding and mitigating alignment failures at source. To encourage further exploration and development, we open-source our implementation at: https://anonymous.4open.science/r/tracealign-2DA7