Instructional Material
R-WoM: Retrieval-augmented World Model For Computer-use Agents
Mei, Kai, Guo, Jiang, Chang, Shuaichen, Dong, Mingwen, Lee, Dongkyu, Niu, Xing, Jiang, Jiarong
Large Language Models (LLMs) can serve as world models to enhance agent decision-making in digital environments by simulating future states and predicting action outcomes, potentially eliminating costly trial-and-error exploration. However, this capability is fundamentally limited by LLM's tendency to hallucination and their reliance on static training knowledge, which could lead to compounding errors that inhibit long-horizon simulations. To systematically investigate whether LLMs are appropriate for world modeling, we probe two core capabilities of world models - future state prediction and reward estimation - through three tasks: next-state identification, full-procedure planning alignment, and milestone transition recognition. Our analysis shows that while LLMs effectively capture immediate next states and identify meaningful state transitions, their performance rapidly degrades in full-procedure planning. This highlights LLMs' limitations in reliably modeling environment dynamics over long horizons. To address these limitations, we propose the Retrieval-augmented World Model (R-WoM), which grounds LLM simulations by incorporating factual, up-to-date knowledge retrieved from external tutorials. Experiments show that R-WoM achieves substantial improvements of up to 25.3% (OSWorld) and 18.1% (WebArena) compared to baselines, with particular advantage in longer-horizon simulations. World models have evolved from early symbolic planning systems to sophisticated neural architectures that learn latent representations of environment dynamics. Model-based reinforcement learning (MBRL) approaches, such as Dreamer v1-3 (Hafner et al., 2019; 2020; 2023) and MuZero (Schrittwieser et al., 2020), learn latent world models to "imagine" trajectories before selecting actions. More recently, Large Language Model (LLM)-based world models (Hao et al., 2023; Wang et al., 2024; Zhang et al., 2024) have emerged as a new paradigm, leveraging large-scale pre-training to reason about action consequences in realistic digital environments.
GAR: Generative Adversarial Reinforcement Learning for Formal Theorem Proving
Wang, Ruida, Yao, Jiarui, Pan, Rui, Diao, Shizhe, Zhang, Tong
Solving math problems through verifiable languages such as Lean has significantly impacted both the mathematics and computer science communities. Current state-of-the-art models are often trained with expensive online Reinforcement Learning (RL) or expert iteration. However, these approaches rely on fixed problem sets, which causes inefficient training and limits the model to tackle complex problems. To overcome these limitations, we propose GAR: Generative Adversarial Reinforcement learning, a comprehensive RL training framework that jointly trains the problem composer and solver in an adversarial loop. GAR introduces an implicit curriculum learning mechanism, which aligns task difficulty with the prover's evolving capability. It thereby improves the training efficiency and enables stronger performance of proving advanced theorems. Experiments show that with GAR training, Goedel-Prover-V2-8B and DeepSeek-Prover-V2-7B achieve an average relative improvement in pass@32 of 4.20% on MiniF2F-Test benchmark, while DeepSeek-Prover-V2's pass@32 on ProofNet-Test increases from 22.58% to 25.81%. Beyond formal proving, GAR establishes a general RL paradigm for co-evolution of problem generation and solving under verifiable environments.
Artificial Intelligence for Optimal Learning: A Comparative Approach towards AI-Enhanced Learning Environments
In the rapidly evolving educational landscape, the integration of technology has shifted from an enhancement to a cornerstone of educational strategy worldwide. This transition is propelled by advancements in digital technology, especially the emergence of artificial intelligence as a crucial tool in learning environments. This research project critically evaluates the impact of three distinct educational settings: traditional educational methods without technological integration, those enhanced by non-AI technology, and those utilising AI-driven technologies. This comparison aims to assess how each environment influences educational outcomes, engagement, pedagogical methods, and equity in access to learning resources, and how each contributes uniquely to the learning experience. The ultimate goal of this research is to synthesise the strengths of each model to create a more holistic educational approach. By integrating the personal interaction and tested pedagogical techniques of traditional classrooms, the enhanced accessibility and collaborative tools offered by non-AI technology, and the personalised, adaptive learning strategies enabled by AI-driven technologies, education systems can develop richer, more effective learning environments. This hybrid approach aims to leverage the best elements of each setting, thereby enhancing educational outcomes, engagement, and inclusiveness, while also addressing the distinct challenges and limitations inherent in each model. The intention is to create an educational framework deeply attentive to the diverse needs of students, ensuring equitable access to high-quality education for all.
Near-Optimal Second-Order Guarantees for Model-Based Adversarial Imitation Learning
Li, Shangzhe, Zhou, Dongruo, Zhang, Weitong
We study online adversarial imitation learning (AIL), where an agent learns from offline expert demonstrations and interacts with the environment online without access to rewards. Despite strong empirical results, the benefits of online interaction and the impact of stochasticity remain poorly understood. We address these gaps by introducing a model-based AIL algorithm (MB-AIL) and establish its horizon-free, second-order sample-complexity guarantees under general function approximations for both expert data and reward-free interactions. These second-order bounds provide an instance-dependent result that can scale with the variance of returns under the relevant policies and therefore tighten as the system approaches determinism. Together with second-order, information-theoretic lower bounds on a newly constructed hard-instance family, we show that MB-AIL attains minimax-optimal sample complexity for online interaction (up to logarithmic factors) with limited expert demonstrations and matches the lower bound for expert demonstrations in terms of the dependence on horizon $H$, precision $ฮต$ and the policy variance $ฯ^2$. Experiments further validate our theoretical findings and demonstrate that a practical implementation of MB-AIL matches or surpasses the sample efficiency of existing methods.
Learning Diffusion Models with Flexible Representation Guidance
Wang, Chenyu, Zhou, Cai, Gupta, Sharut, Lin, Zongyu, Jegelka, Stefanie, Bates, Stephen, Jaakkola, Tommi
Diffusion models can be improved with additional guidance towards more effective representations of input. Indeed, prior empirical work has already shown that aligning internal representations of the diffusion model with those of pre-trained models improves generation quality. In this paper, we present a systematic framework for incorporating representation guidance into diffusion models. We provide alternative decompositions of denoising models along with their associated training criteria, where the decompositions determine when and how the auxiliary representations are incorporated. Guided by our theoretical insights, we introduce two new strategies for enhancing representation alignment in diffusion models. First, we pair examples with target representations either derived from themselves or arisen from different synthetic modalities, and subsequently learn a joint model over the multimodal pairs. Second, we design an optimal training curriculum that balances representation learning and data generation. Our experiments across image, protein sequence, and molecule generation tasks demonstrate superior performance as well as accelerated training. In particular, on the class-conditional ImageNet $256\times 256$ benchmark, our guidance results in $23.3$ times faster training than the original SiT-XL as well as four times speedup over the state-of-the-art method REPA. The code is available at https://github.com/ChenyuWang-Monica/REED.
A Community-driven vision for a new Knowledge Resource for AI
Chaudhri, Vinay K, Baru, Chaitan, Bennett, Brandon, Bhatt, Mehul, Cassel, Darion, Cohn, Anthony G, Dechter, Rina, Erdem, Esra, Ferrucci, Dave, Forbus, Ken, Gelfond, Gregory, Genesereth, Michael, Gordon, Andrew S., Grosof, Benjamin, Gupta, Gopal, Hendler, Jim, Israni, Sharat, Josephson, Tyler R., Kyllonen, Patrick, Lierler, Yuliya, Lifschitz, Vladimir, McFate, Clifton, McGinty, Hande K., Morgenstern, Leora, Oltramari, Alessandro, Paritosh, Praveen, Roth, Dan, Shepard, Blake, Shimzu, Cogan, Vrandeฤiฤ, Denny, Whiting, Mark, Witbrock, Michael
The Cyc project, started in 1984, created the first large-scale database of commonsense knowledge. The initiative continues to this day with its aim to provide a comprehensive ontology and knowledge base of commonsense knowledge to enable human-like reasoning for AI systems. In the concluding paragraph of his Communications of the Association of Computing Machinery (CACM) 1995 article A Large-Scale Investment in Knowledge Infrastructure [52], Cyc's founder Douglas B. Lenat wrote: Is Cyc necessary? How far would a user get with something simpler than Cyc but that lacks everyday commonsense knowledge? Nobody knows; the question will be settled empirically. Our guess is most of these applications will eventually tap the synergy in a suite of sources (including neural nets and decision theory), one of which will be Cyc. Although 30 years have passed since the above article was written, AI research community has not conclusively settled [10] the question "How far would a user get with something simpler than Cyc but that lacks everyday commonsense knowledge?" However, it is clear that significant strides have been made in addressing many of the tasks that were original Cyc use cases, including information retrieval, semi-automatically linking multiple heterogeneous external information sources, spelling and grammar correction, machine translation, natural language understanding and speech understanding.
A Clustering-Based Method for Automatic Educational Video Recommendation Using Deep Face-Features of Lecturers
Mendes, Paulo R. C., Vieira, Eduardo S., Guedes, รlan L. V., Busson, Antonio J. G., Colcher, Sรฉrgio
Discovering and accessing specific content within educational video bases is a challenging task, mainly because of the abundance of video content and its diversity. Recommender systems are often used to enhance the ability to find and select content. But, recommendation mechanisms, especially those based on textual information, exhibit some limitations, such as being error-prone to manually created keywords or due to imprecise speech recognition. This paper presents a method for generating educational video recommendation using deep face-features of lecturers without identifying them. More precisely, we use an unsupervised face clustering mechanism to create relations among the videos based on the lecturer's presence. Then, for a selected educational video taken as a reference, we recommend the ones where the presence of the same lecturers is detected. Moreover, we rank these recommended videos based on the amount of time the referenced lecturers were present. For this task, we achieved a mAP value of 99.165%.
Lecture Notes on Verifying Graph Neural Networks
In these lecture notes, we first recall the connection between graph neural networks, Weisfeiler-Lehman tests and logics such as first-order logic and graded modal logic. We then present a modal logic in which counting modalities appear in linear inequalities in order to solve verification tasks on graph neural networks. We describe an algorithm for the satisfiability problem of that logic. It is inspired from the tableau method of vanilla modal logic, extended with reasoning in quantifier-free fragment Boolean algebra with Presburger arithmetic.
IntersectioNDE: Learning Complex Urban Traffic Dynamics based on Interaction Decoupling Strategy
Lin, Enli, Yang, Ziyuan, Lu, Qiujing, Hu, Jianming, Feng, Shuo
Realistic traffic simulation is critical for ensuring the safety and reliability of autonomous vehicles (AVs), especially in complex and diverse urban traffic environments. However, existing data-driven simulators face two key challenges: a limited focus on modeling dense, heterogeneous interactions at urban intersections - which are prevalent, crucial, and practically significant in countries like China, featuring diverse agents including motorized vehicles (MVs), non-motorized vehicles (NMVs), and pedestrians - and the inherent difficulty in robustly learning high-dimensional joint distributions for such high-density scenes, often leading to mode collapse and long-term simulation instability. We introduce City Crossings Dataset (CiCross), a large-scale dataset collected from a real-world urban intersection, uniquely capturing dense, heterogeneous multi-agent interactions, particularly with a substantial proportion of MVs, NMVs and pedestrians. Based on this dataset, we propose IntersectioNDE (Intersection Naturalistic Driving Environment), a data-driven simulator tailored for complex urban intersection scenarios. Its core component is the Interaction Decoupling Strategy (IDS), a training paradigm that learns compositional dynamics from agent subsets, enabling the marginal-to-joint simulation. Integrated into a scene-aware Transformer network with specialized training techniques, IDS significantly enhances simulation robustness and long-term stability for modeling heterogeneous interactions. Experiments on CiCross show that IntersectioNDE outperforms baseline methods in simulation fidelity, stability, and its ability to replicate complex, distribution-level urban traffic dynamics.
Extended Triangular Method: A Generalized Algorithm for Contradiction Separation Based Automated Deduction
Xu, Yang, Chen, Shuwei, Liu, Jun, Cao, Feng, He, Xingxing
Automated deduction lies at the core of Artificial Intelligence (AI), underpinning theorem proving, formal verification, and logical reasoning. Despite decades of progress, reconciling deductive completeness with computational efficiency remains an enduring challenge. Traditional reasoning calculi, grounded in binary resolution, restrict inference to pairwise clause interactions and thereby limit deductive synergy among multiple clauses. The Contradiction Separation Extension (CSE) framework, introduced in 2018, proposed a dynamic multi-clause reasoning theory that redefined logical inference as a process of contradiction separation rather than sequential resolution. While that work established the theoretical foundation, its algorithmic realization remained unformalized and unpublished. This work presents the Extended Triangular Method (ETM), a generalized contradiction-construction algorithm that formalizes and extends the internal mechanisms of contradiction separation. The ETM unifies multiple contradiction-building strategies, including the earlier Standard Extension method, within a triangular geometric framework that supports flexible clause interaction and dynamic synergy. ETM serves as the algorithmic core of several high-performance theorem provers, CSE, CSE-E, CSI-E, and CSI-Enig, whose competitive results in standard first-order benchmarks (TPTP problem sets and CASC 2018-2015) empirically validate the effectiveness and generality of the proposed approach. By bridging theoretical abstraction and operational implementation, ETM advances the contradiction separation paradigm into a generalized, scalable, and practically competitive model for automated reasoning, offering new directions for future research in logical inference and theorem proving.