Instructional Material
A Rolling Stone Gathers No Moss: Adaptive Policy Optimization for Stable Self-Evaluation in Large Multimodal Models
Wang, Wenkai, Guo, Hongcan, Lv, Zheqi, Zhang, Shengyu
Self-evaluation, a model's ability to assess the correctness of its own output, is crucial for Large Multimodal Models (LMMs) to achieve self-improvement in multi-turn conversations, yet largely absent in foundation models. Recent work has employed reinforcement learning (RL) to enhance self-evaluation; however, its fixed reward mechanism suffers from reward hacking when optimizing multiple training objectives, leading to model collapse. In this paper we propose AdaPO, an online reinforcement learning framework capable of adaptively adjusting training objective in real time according to the current training state for each task. Specifically, to mitigate reward hacking , AdaPO introduces an Adaptive Reward Model (ARM) and a Reward Aware Dynamic KL Regularization mechanism. ARM assesses the task's training state from the distribution of model generated multi-turn trajectories' performance. Reward Aware Dynamic KL replaces a fixed penalty with dynamic coefficients which is modulated by the reward gap between different multi-turn situations. Notably, our method automatically and smoothly adjusts its learning focus based on sub-tasks' training progress without manual intervention. Extensive experiments over 8 benchmarks and various models show that our method significantly enhances both direct reasoning and self-evaluation capability. We will release our code to contribute to the community.
Probabilistic Emissivity Retrieval from Hyperspectral Data via Physics-Guided Variational Inference
Tempelman, Joshua R., Mitchell, Kevin, Wachtor, Adam J., Flynn, Eric B.
Recent research has proven neural networks to be a powerful tool for performing hyperspectral imaging (HSI) target identification. However, many deep learning frameworks deliver a single material class prediction and operate on a per-pixel basis; such approaches are limited in their interpretability and restricted to predicting materials that are accessible in available training libraries. In this work, we present an inverse modeling approach in the form of a physics-conditioned generative model.A probabilistic latent-variable model learns the underlying distribution of HSI radiance measurements and produces the conditional distribution of the emissivity spectrum. Moreover, estimates of the HSI scene's atmosphere and background are used as a physically relevant conditioning mechanism to contextualize a given radiance measurement during the encoding and decoding processes. Furthermore, we employ an in-the-loop augmentation scheme and physics-based loss criteria to avoid bias towards a predefined training material set and to encourage the model to learn physically consistent inverse mappings. Monte-Carlo sampling of the model's conditioned posterior delivers a sought emissivity distribution and allows for interpretable uncertainty quantification. Moreover, a distribution-based material matching scheme is presented to return a set of likely material matches for an inferred emissivity distribution. Hence, we present a strategy to incorporate contextual information about a given HSI scene, capture the possible variation of underlying material spectra, and provide interpretable probability measures of a candidate material accounting for given remotely-sensed radiance measurement.
Gradual Transition from Bellman Optimality Operator to Bellman Operator in Online Reinforcement Learning
Omura, Motoki, Ota, Kazuki, Osa, Takayuki, Mukuta, Yusuke, Harada, Tatsuya
For continuous action spaces, actor-critic methods are widely used in online reinforcement learning (RL). However, unlike RL algorithms for discrete actions, which generally model the optimal value function using the Bellman optimality operator, RL algorithms for continuous actions typically model Q-values for the current policy using the Bellman operator. These algorithms for continuous actions rely exclusively on policy updates for improvement, which often results in low sample efficiency. This study examines the effectiveness of incorporating the Bellman optimality operator into actor-critic frameworks. Experiments in a simple environment show that modeling optimal values accelerates learning but leads to overestimation bias. To address this, we propose an annealing approach that gradually transitions from the Bellman optimality operator to the Bellman operator, thereby accelerating learning while mitigating bias. Our method, combined with TD3 and SAC, significantly outperforms existing approaches across various locomotion and manipulation tasks, demonstrating improved performance and robustness to hyperparameters related to optimality. The code for this study is available at https://github.com/motokiomura/annealed-q-learning.
What's coming up at #IJCAI2025?
The IJCAI-25 logo and theme photo (cropped). The 34rd International Joint Conference on Artificial Intelligence (IJCAI-25) will be held in Montrรฉal, Canada from 16-22 August. The programme will feature keynote talks, tutorials, workshops, competitions, and oral and poster presentations. There will also be four special tracks, focussing on: AI for social good, AI and arts, human-centred AI, and AI enabling critical technologies. An exciting addition this year is the satellite event, to be held in Guangzhou, China, from 29-31 August.
Bio-Inspired Artificial Neural Networks based on Predictive Coding
Casnici, Davide, Frenkel, Charlotte, Dauwels, Justin
Backpropagation (BP) of errors is the backbone training algorithm for artificial neural networks (ANNs). It updates network weights through gradient descent to minimize a loss function representing the mismatch between predictions and desired outputs. BP uses the chain rule to propagate the loss gradient backward through the network hierarchy, allowing efficient weight updates. However, this process requires weight updates at every layer to rely on a global error signal generated at the network's output. In contrast, the Hebbian model of synaptic plasticity states that weight updates are local, depending only on the activity of pre- and post-synaptic neurons. This suggests biological brains likely do not implement BP directly. Recently, Predictive Coding (PC) has gained interest as a biologically plausible alternative that updates weights using only local information. Originating from 1950s work on signal compression, PC was later proposed as a model of the visual cortex and formalized under the free energy principle, linking it to Bayesian inference and dynamical systems. PC weight updates rely solely on local information and provide theoretical advantages such as automatic scaling of gradients based on uncertainty. This lecture notes column offers a novel, tutorial-style introduction to PC, focusing on its formulation, derivation, and connections to well-known optimization and signal processing algorithms such as BP and the Kalman Filter (KF). It aims to support existing literature by guiding readers from the mathematical foundations of PC to practical implementation, including Python examples using PyTorch.
Simulating Generative Social Agents via Theory-Informed Workflow Design
Yan, Yuwei, Piao, Jinghua, Lan, Xiaochong, Shao, Chenyang, Hui, Pan, Li, Yong
Recent advances in large language models have demonstrated strong reasoning and role-playing capabilities, opening new opportunities for agent-based social simulations. However, most existing agents' implementations are scenario-tailored, without a unified framework to guide the design. This lack of a general social agent limits their ability to generalize across different social contexts and to produce consistent, realistic behaviors. To address this challenge, we propose a theory-informed framework that provides a systematic design process for LLM-based social agents. Our framework is grounded in principles from Social Cognition Theory and introduces three key modules: motivation, action planning, and learning. These modules jointly enable agents to reason about their goals, plan coherent actions, and adapt their behavior over time, leading to more flexible and contextually appropriate responses. Comprehensive experiments demonstrate that our theory-driven agents reproduce realistic human behavior patterns under complex conditions, achieving up to 75% lower deviation from real-world behavioral data across multiple fidelity metrics compared to classical generative baselines. Ablation studies further show that removing motivation, planning, or learning modules increases errors by 1.5 to 3.2 times, confirming their distinct and essential contributions to generating realistic and coherent social behaviors.
Multi-level Collaborative Distillation Meets Global Workspace Model: A Unified Framework for OCIL
Su, Shibin, Liang, Guoqiang, Cheng, De, Zhang, Shizhou, Ran, Lingyan, Zhang, Yanning
--Online Class-Incremental Learning (OCIL) enables models to learn continuously from non-i.i.d. However, OCIL faces two key challenges: maintaining model stability under strict memory constraints and ensuring adaptability to new tasks. Under stricter memory constraints, current replay-based methods are less effective. While ensemble methods improve adaptability (plasticity), they often struggle with stability. T o overcome these challenges, we propose a novel approach that enhances ensemble learning through a Global Workspace Model (GWM)--a shared, implicit memory that guides the learning of multiple student models. The GWM is formed by fusing the parameters of all students within each training batch, capturing the historical learning trajectory and serving as a dynamic anchor for knowledge consolidation. This fused model is then redistributed periodically to the students to stabilize learning and promote cross-task consistency. In addition, we introduce a multi-level collaborative distillation mechanism. This approach enforces peer-to-peer consistency among students and preserves historical knowledge by aligning each student with the GWM. As a result, student models remain adaptable to new tasks while maintaining previously learned knowledge, striking a better balance between stability and plasticity. Extensive experiments on three standard OCIL benchmarks show that our method delivers significant performance improvement for several OCIL models across various memory budgets. Class-Incremental Learning is designed to integrate the knowledge of classes from a stream of data with an evolved distribution [1].
Securing Educational LLMs: A Generalised Taxonomy of Attacks on LLMs and DREAD Risk Assessment
Zahid, Farzana, Sewwandi, Anjalika, Brandon, Lee, Kumar, Vimal, Sinha, Roopak
Due to perceptions of efficiency and significant productivity gains, various organisations, including in education, are adopting Large Language Models (LLMs) into their workflows. Educator-facing, learner-facing, and institution-facing LLMs, collectively, Educational Large Language Models (eLLMs), complement and enhance the effectiveness of teaching, learning, and academic operations. However, their integration into an educational setting raises significant cybersecurity concerns. A comprehensive landscape of contemporary attacks on LLMs and their impact on the educational environment is missing. This study presents a generalised taxonomy of fifty attacks on LLMs, which are categorized as attacks targeting either models or their infrastructure. The severity of these attacks is evaluated in the educational sector using the DREAD risk assessment framework. Our risk assessment indicates that token smuggling, adversarial prompts, direct injection, and multi-step jailbreak are critical attacks on eLLMs. The proposed taxonomy, its application in the educational environment, and our risk assessment will help academic and industrial practitioners to build resilient solutions that protect learners and institutions.
Assessing the Quality of AI-Generated Exams: A Large-Scale Field Study
Isley, Calvin, Gilbert, Joshua, Kassos, Evangelos, Kocher, Michaela, Nie, Allen, Brunskill, Emma, Domingue, Ben, Hofman, Jake, Legewie, Joscha, Svoronos, Teddy, Tuminelli, Charlotte, Goel, Sharad
While large language models (LLMs) challenge conventional methods of teaching and learning, they present an exciting opportunity to improve efficiency and scale high-quality instruction. One promising application is the generation of customized exams, tailored to specific course content. There has been significant recent excitement on automatically generating questions using artificial intelligence, but also comparatively little work evaluating the psychometric quality of these items in real-world educational settings. Filling this gap is an important step toward understanding generative AI's role in effective test design. In this study, we introduce and evaluate an iterative refinement strategy for question generation, repeatedly producing, assessing, and improving questions through cycles of LLM-generated critique and revision. We evaluate the quality of these AI-generated questions in a large-scale field study involving 91 classes -- covering computer science, mathematics, chemistry, and more -- in dozens of colleges across the United States, comprising nearly 1700 students. Our analysis, based on item response theory (IRT), suggests that for students in our sample the AI-generated questions performed comparably to expert-created questions designed for standardized exams. Our results illustrate the power of AI to make high-quality assessments more readily available, benefiting both teachers and students.
Designing a Feedback-Driven Decision Support System for Dynamic Student Intervention
Adeyemi, Timothy Oluwapelumi, AlOtaibi, Nadiah Fahad
Accurate prediction of student performance is essential for enabling timely academic interventions. However, most machine learning models used in educational settings are static and lack the ability to adapt when new data such as post-intervention outcomes become available. To address this limitation, we propose a Feedback-Driven Decision Support System (DSS) with a closed-loop architecture that enables continuous model refinement. The system employs a LightGBM-based regressor with incremental retraining, allowing educators to input updated student performance data, which automatically triggers model updates. This adaptive mechanism enhances prediction accuracy by learning from real-world academic progress over time. The platform features a Flask-based web interface to support real-time interaction and integrates SHAP (SHapley Additive exPlanations) for model interpretability, ensuring transparency and trustworthiness in predictions. Experimental results demonstrate a 10.7% reduction in RMSE after retraining, with consistent upward adjustments in predicted scores for students who received interventions. By transforming static predictive models into self-improving systems, our approach advances educational analytics toward human-centered, data-driven, and responsive artificial intelligence. The framework is designed for seamless integration into Learning Management Systems (LMS) and institutional dashboards, facilitating practical deployment in real educational environments.