Planning behavior in a recurrent neural network that plays Sokoban

Garriga-Alonso, Adrià, Taufeeque, Mohammad, Gleave, Adam

arXiv.org Artificial Intelligence 

In many tasks, the performance of both humans and some neural networks (NNs) improves with more reasoning: whether by giving a human time to think before making a chess move, or by prompting or training a large language model (LLM) to reason step by step [Kojima et al., 2022, OpenAI, 2024]. Among other reasoning capabilities, goal-oriented reasoning is particularly relevant to AI alignment. So-called "mesa-optimizers" - AIs that have learned to pursue goals through internal reasoning [Hubinger et al., 2019] - may internalize goals different from the training objective, leading to goal misgeneralization [Di Langosco et al., 2022, Shah et al., 2022]. Understanding how NNs learn to plan and represent the objective could be key to detect, prevent or correct goal misgeneralization. In this work, we focus on interpreting a Deep Repeating ConvL-STM [Guez et al., 2019, DRC] trained on Sokoban, a puzzle game often used as a planning benchmark [Peters et al., 2023]. We interpret the best network from Guez et al. [2019], DRC (3, 3), with 3 recurrent layers that are applied 3 times per environment step. Further details of the network are provided in Section 2. We find that its internal plan representation [Bush et al., 2025] is causal, improves with more computation, and that the DRC learns to take advantage of that by often "pacing" to get enough time to refine its internal plan. We show similar results in Appendix B for another DRC network and causal plan representation in a ResNet model.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found