Goto

Collaborating Authors

 montezuma



Entropic Desired Dynamics for Intrinsic Control: Supplemental Material Steven Hansen

Neural Information Processing Systems

While this is not close to the state-of-the-art in general (c.f. Figure 2 shows the effect of action entropy on exploratory behavior in Montezuma's Revenge. Number of unique avatar positions visited. Full training curves across all 6 Atari games are shown in Figure 1, including the random policy baseline. To ensure this didn't hamper performance, we At each state visited by the agent evaluator during training, the agent's state (consisting of the avatar's The full curves are included for completeness. The compute cluster we performed experiments on is heterogenous, and has features such as host-sharing, adaptive load-balancing, etc.


PoE-World: Compositional World Modeling with Products of Programmatic Experts

Piriyakulkij, Wasu Top, Liang, Yichao, Tang, Hao, Weller, Adrian, Kryven, Marta, Ellis, Kevin

arXiv.org Artificial Intelligence

Learning how the world works is central to building AI agents that can adapt to complex environments. Traditional world models based on deep learning demand vast amounts of training data, and do not flexibly update their knowledge from sparse observations. Recent advances in program synthesis using Large Language Models (LLMs) give an alternate approach which learns world models represented as source code, supporting strong generalization from little data. To date, application of program-structured world models remains limited to natural language and grid-world domains. We introduce a novel program synthesis method for effectively modeling complex, non-gridworld domains by representing a world model as an exponentially-weighted product of programmatic experts (PoE-World) synthesized by LLMs. We show that this approach can learn complex, stochastic world models from just a few observations. We evaluate the learned world models by embedding them in a model-based planning agent, demonstrating efficient performance and generalization to unseen levels on Atari's Pong and Montezuma's Revenge. We release our code and display the learned world models and videos of the agent's gameplay at https://topwasu.github.io/poe-world.



Parametrically Retargetable Decision-Makers Tend To Seek Power

Neural Information Processing Systems

In fully observable environments, most reward functions have an optimal policy which seeks power by keeping options open and staying alive [ Turner et al., 2021 ]. However, the real world is neither fully observable, nor must trained agents be even approximately reward-optimal.




We thank the reviewers for thoroughly commenting on our article; their comments give us the opportunity to improve

Neural Information Processing Systems

For Montezuma's Revenge, the average prediction error is In this case, the irrelevant intrinsic reward completely obscures the target goal. The less information is available about this step, the more uncertain the model and the higher the error. R4, in general we cannot guarantee that the prediction error is a measure of uncertainty. For an intuition about W-MSE representation and stochasticity, let's consider the noisy TV experiment: there is a TV in Atari and compare it with the best-performing methods such as NGU. To show how the seed affects the performance we included Figure 1 with training dynamics in the supplementary.