Planning behavior in a recurrent neural network that plays Sokoban

Garriga-Alonso, Adrià, Taufeeque, Mohammad, Gleave, Adam

Jul-22-2024–arXiv.org Artificial Intelligence

In many tasks, the performance of both humans and some neural networks (NNs) improves with more reasoning: whether by giving a human time to think before making a chess move, or by prompting or training a large language model (LLM) to reason step by step [Kojima et al., 2022, OpenAI, 2024]. Among other reasoning capabilities, goal-oriented reasoning is particularly relevant to AI alignment. So-called "mesa-optimizers" - AIs that have learned to pursue goals through internal reasoning [Hubinger et al., 2019] - may internalize goals different from the training objective, leading to goal misgeneralization [Di Langosco et al., 2022, Shah et al., 2022]. Understanding how NNs learn to plan and represent the objective could be key to detect, prevent or correct goal misgeneralization. In this work, we focus on interpreting a Deep Repeating ConvL-STM [Guez et al., 2019, DRC] trained on Sokoban, a puzzle game often used as a planning benchmark [Peters et al., 2023]. We interpret the best network from Guez et al. [2019], DRC (3, 3), with 3 recurrent layers that are applied 3 times per environment step. Further details of the network are provided in Section 2. We find that its internal plan representation [Bush et al., 2025] is causal, improves with more computation, and that the DRC learns to take advantage of that by often "pacing" to get enough time to refine its internal plan. We show similar results in Appendix B for another DRC network and causal plan representation in a ResNet model.

agent, drc, probe, (15 more...)

arXiv.org Artificial Intelligence

Jul-22-2024

arXiv.org PDF

Add feedback

Country:
- South America > Chile
  - Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- North America > United States
  - California > Alameda County > Berkeley (0.04)
- Europe > France
  - Hauts-de-France > Nord > Lille (0.04)
- Asia > Middle East
  - Jordan (0.04)

Genre:
- Research Report > New Finding (0.93)

Industry:
- Leisure & Entertainment > Games > Chess (0.34)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found