minigrid
Function
Algorithm 2 details the pseudocode for the partition function used in LaMCTS, which we use in LaP3 as well. Algorithm 2 Partition Function 1: Input: Input Space Ω, Samples St, Node partition threshold Nthres, Partitioning Latent Model s(x) 2: Set V0 = {Ω} 3: Set Vqueue = {Ω} 4: while Vqueue 6= do 5: Ωp Vqueue.pop(0) It is clear that Fk(y) is a monotonically decreasing function with Fk(0) = 1 and limy + Fk(y) = 0. Here we assume it is strictly decreasing so that Fk(y) has a well-defined inverse function F 1k . In the following, we will omit the subscript k for brevity. P[f(xi) g y|xi Ωk] (4) = 1 Fntk (y) (5) Note that 1 is due to the fact that all samples x1,...,xnt are independently drawn within the region Ωk.
TRACED: Transition-aware Regret Approximation with Co-learnability for Environment Design
Cho, Geonwoo, Im, Jaegyun, Lee, Jihwan, Yi, Hojun, Kim, Sejin, Kim, Sundong
Generalizing deep reinforcement learning agents to unseen environments remains a significant challenge. One promising solution is Unsupervised Environment Design (UED), a co-evolutionary framework in which a teacher adaptively generates tasks with high learning potential, while a student learns a robust policy from this evolving curriculum. Existing UED methods typically measure learning potential via regret, the gap between optimal and current performance, approximated solely by value-function loss. Building on these approaches, we introduce the transition-prediction error as an additional term in our regret approximation. To capture how training on one task affects performance on others, we further propose a lightweight metric called Co-Learnability. By combining these two measures, we present Transition-aware Regret Approximation with Co-learnability for Environment Design (TRACED). Empirical evaluations show that TRACED produces curricula that improve zero-shot generalization over strong baselines across multiple benchmarks. Ablation studies confirm that the transition-prediction error drives rapid complexity ramp-up and that Co-Learnability delivers additional gains when paired with the transition-prediction error. These results demonstrate how refined regret approximation and explicit modeling of task relationships can be leveraged for sample-efficient curriculum design in UED. Project Page: https://geonwoo.me/traced/
Towards Monotonic Improvement in In-Context Reinforcement Learning
Zhang, Wenhao, Zhang, Shao, Wang, Xihuai, Li, Yang, Wen, Ying
In-Context Reinforcement Learning (ICRL) has emerged as a promising paradigm for developing agents that can rapidly adapt to new tasks by leveraging past experiences as context, without updating their parameters. Recent approaches train large sequence models on monotonic policy improvement data from online RL, aiming to a continue improved testing time performance. However, our experimental analysis reveals a critical flaw: these models cannot show a continue improvement like the training data during testing time. Theoretically, we identify this phenomenon as Contextual Ambiguity, where the model's own stochastic actions can generate an interaction history that misleadingly resembles that of a sub-optimal policy from the training data, initiating a vicious cycle of poor action selection. To resolve the Contextual Ambiguity, we introduce Context Value into training phase and propose Context Value Informed ICRL (CV-ICRL). CV-ICRL use Context Value as an explicit signal representing the ideal performance theoretically achievable by a policy given the current context. As the context expands, Context Value could include more task-relevant information, and therefore the ideal performance should be non-decreasing. We prove that the Context Value tightens the lower bound on the performance gap relative to an ideal, monotonically improving policy. We fruther propose two methods for estimating Context Value at both training and testing time. Experiments conducted on the Dark Room and Minigrid testbeds demonstrate that CV-ICRL effectively mitigates performance degradation and improves overall ICRL abilities across various tasks and environments. The source code and data of this paper are available at https://github.com/Bluixe/towards_monotonic_improvement .
1.5M Steps 3.1M Steps RND BeBold 6.4M Steps 4.6M Steps 7.5M Steps 9.8M Steps 1.0M Steps 1.4M Steps 3.4M Steps 2.4M Steps 3.9M Steps 4.8M Steps
We provide final testing performance for NovelD and all baselines in MiniGrid. We also provide more intrinsic analysis similar to Sec. 4.2 in a seven-room environment in Figure 1. There are other categories of static environment. The initial position of the agent and goal can be random. The position of the agent and goal is randomized.