minigrid
- Asia > Middle East > Jordan (0.04)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- (4 more...)
- Information Technology > Artificial Intelligence > Robots (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.95)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.95)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)
TRACED: Transition-aware Regret Approximation with Co-learnability for Environment Design
Cho, Geonwoo, Im, Jaegyun, Lee, Jihwan, Yi, Hojun, Kim, Sejin, Kim, Sundong
Generalizing deep reinforcement learning agents to unseen environments remains a significant challenge. One promising solution is Unsupervised Environment Design (UED), a co-evolutionary framework in which a teacher adaptively generates tasks with high learning potential, while a student learns a robust policy from this evolving curriculum. Existing UED methods typically measure learning potential via regret, the gap between optimal and current performance, approximated solely by value-function loss. Building on these approaches, we introduce the transition-prediction error as an additional term in our regret approximation. To capture how training on one task affects performance on others, we further propose a lightweight metric called Co-Learnability. By combining these two measures, we present Transition-aware Regret Approximation with Co-learnability for Environment Design (TRACED). Empirical evaluations show that TRACED produces curricula that improve zero-shot generalization over strong baselines across multiple benchmarks. Ablation studies confirm that the transition-prediction error drives rapid complexity ramp-up and that Co-Learnability delivers additional gains when paired with the transition-prediction error. These results demonstrate how refined regret approximation and explicit modeling of task relationships can be leveraged for sample-efficient curriculum design in UED. Project Page: https://geonwoo.me/traced/
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.40)
- Asia > Middle East > Jordan (0.04)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- (4 more...)
- Information Technology > Artificial Intelligence > Robots (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.95)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.95)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)
A LaMCTS Partition Function
Finally, if a good event doesn't happen (with probability Intuitively this means the most diluted / scattered region. Here we additionally use Mujoco, a commonly used benchmark, to validate the performance. Mujoco is a very smooth task and doesn't contain many local minima, so traditional methods work In Tab. 3, we can see that in easier tasks like Reacher and Pusher, Table 3: Results for Mujoco with replanning frequency of 5. We see that Results for MiniWorld tasks for different methods using a learned PETS transition model. To loosely approximate the Lipschitz constant in our analysis from Sec. 5.3, we simply check all pairwise Lipschitz constants between existing samples (candidate trajectories) in the tree node However, in most cases it still continues to decrease over time. While this is consistent with our qualitative analysis in Sec.
Towards Monotonic Improvement in In-Context Reinforcement Learning
Zhang, Wenhao, Zhang, Shao, Wang, Xihuai, Li, Yang, Wen, Ying
In-Context Reinforcement Learning (ICRL) has emerged as a promising paradigm for developing agents that can rapidly adapt to new tasks by leveraging past experiences as context, without updating their parameters. Recent approaches train large sequence models on monotonic policy improvement data from online RL, aiming to a continue improved testing time performance. However, our experimental analysis reveals a critical flaw: these models cannot show a continue improvement like the training data during testing time. Theoretically, we identify this phenomenon as Contextual Ambiguity, where the model's own stochastic actions can generate an interaction history that misleadingly resembles that of a sub-optimal policy from the training data, initiating a vicious cycle of poor action selection. To resolve the Contextual Ambiguity, we introduce Context Value into training phase and propose Context Value Informed ICRL (CV-ICRL). CV-ICRL use Context Value as an explicit signal representing the ideal performance theoretically achievable by a policy given the current context. As the context expands, Context Value could include more task-relevant information, and therefore the ideal performance should be non-decreasing. We prove that the Context Value tightens the lower bound on the performance gap relative to an ideal, monotonically improving policy. We fruther propose two methods for estimating Context Value at both training and testing time. Experiments conducted on the Dark Room and Minigrid testbeds demonstrate that CV-ICRL effectively mitigates performance degradation and improves overall ICRL abilities across various tasks and environments. The source code and data of this paper are available at https://github.com/Bluixe/towards_monotonic_improvement .
- South America > Suriname > Marowijne District > Albina (0.04)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- Asia > Middle East > Jordan (0.04)
- Asia > China > Shanghai > Shanghai (0.04)
- North America > United States > California > San Diego County > San Diego (0.04)
- North America > United States > California > Alameda County > Berkeley (0.04)
- Asia > Middle East > Jordan (0.04)
- Asia > China > Shanghai > Shanghai (0.04)
1.5M Steps 3.1M Steps RND BeBold 6.4M Steps 4.6M Steps 7.5M Steps 9.8M Steps 1.0M Steps 1.4M Steps 3.4M Steps 2.4M Steps 3.9M Steps 4.8M Steps
We provide final testing performance for NovelD and all baselines in MiniGrid. We also provide more intrinsic analysis similar to Sec. 4.2 in a seven-room environment in Figure 1. There are other categories of static environment. The initial position of the agent and goal can be random. The position of the agent and goal is randomized.